# **The Auto Class - Loading Tokenizers & Models Manually**  

## **What's Covered?**
1. Introduction to Tokenizers
    - What are Tokenizers?
    - The Problem with Traditional Word-based Tokenizers
    - The Solution - Subword Tokenization
    - Common Tokenization Algorithms
    - Special Tokens
2. Introduction to Auto Classes
    - Loading Models & Tokenizers Manually
    - Why is it called "auto"?
    - Why use Auto classes?
    - Setting up a custom pipeline using Auto Classes
    - Preprocessing Data using AutoTokenizer (i.e. Tokenize the input)
    - Use the AutoModel to solve the Task (i.e. Pass the input to appropriate Model)
3. AutoTokenizer
    - Using .from_pretrained("model_name")
    - Configuration Files
    - What are input_ids, attention_mask and token_type_ids?
    - return_tensor Argument
    - Essential Configuration & Special Token Properties
4. Inspecting Tokenization Step by Step
    - Step 1: Split input_text to tokens
    - Step 2: Convert the tokens to numerical IDs
    - Step 3: Append special tokens the model expects
    - Step 4: Decoding input_ids back to words
    - Example: AutoTokenizer for "roberta-base" model
5. Batching, Padding and Truncation
    - Batching
    - Padding
    - Truncation
    - Fixing the Error: "Unable to create tensor"
    - Most Common Practice: Using truncation and padding with max_length Argument

## **Introduction to Tokenizers**

### **What are Tokenizers?**
At its core, a large language model (LLM) or any deep learning model understands numbers, not raw text. Tokenizers bridge this gap, converting human-readable text into a numerical format that models can process.

### **The Problem with Traditional Word-based Tokenizers**
1. **Vocabulary Size:** If we tried to make every unique word in the world a separate entry in a model's vocabulary, the vocabulary would be enormous (millions of words), making models inefficient and difficult to train.
2. **Out-Of-Vocabulary (OOV) Words:** What happens when the model encounters a word it has never seen during training? Traditional word-based tokenizers would map this to an "unknown" token, losing all semantic information.
3. **Morphology:** Words often share common roots or prefixes/suffixes (e.g., "running," "runs," "ran"). Word-level tokenization treats these as completely distinct, losing potential linguistic connections.

### **The Solution - Subword Tokenization**
Modern tokenizers, especially with Transformer models, use subword tokenization. Instead of splitting text into full words, they break it down into smaller, meaningful units (subwords) that can be combined to form words. This offers a fantastic balance:
1. Smaller Vocabulary: Fewer unique subwords than unique words.
2. Handles OOV: Any new word can be broken down into known subwords (e.g., "unpredictable" could become "un" + "predict" + "able").
3. Morphological Information: Related words share common subwords.

### **Common Tokenization Algorithms**

Let's look at the most common subword algorithms:
| Algorithm | Description | Example |
| ------ | ------ | ------ |
| **BPE (Byte Pair Encoding)** | Breaks rare words into common subwords | `huggingface → hug + ging + face` |
| **WordPiece** | Used in BERT; similar to BPE, adds special prefix (`##`) for subwords | `loving → lov + ##ing` |
| **SentencePiece** | Used in T5, ALBERT; language-agnostic, treats everything as a sequence of bytes | `▁I ▁love ▁AI` |

### **Special Tokens**

Tokenizers also introduce special tokens that models use for specific purposes:

1. `[CLS]` / `<s>` (Classification/Start): Often used at the beginning of a sequence. For classification tasks, the hidden state corresponding to this token is often used as the aggregated representation of the entire sequence.
2. `[SEP]` / `</s>` (Separator/End): Used to separate two sequences (e.g., in question answering, where you have a question and a context) or to mark the end of a single sequence.
3. `[PAD]` / `<pad>` (Padding): Used to fill shorter sequences to the same length as the longest sequence in a batch (more on this later).
4. `[UNK]` / `<unk>` (Unknown): A fallback token for characters or subwords not found in the vocabulary (rare with subword tokenization, but possible).
5. `[MASK]` / `<mask>` (Mask): Used in masked language modeling (e.g., BERT) where a token is randomly masked, and the model has to predict it.

## **Introduction to Auto Classes**

The Auto classes are Hugging Face's genius way of making it incredibly simple to load the correct tokenizer and model architecture for any pre-trained checkpoint from the Hub.

<img width="800" height="500" src="data/images/hugging_face_transformers_pipeline.jpeg">


### **Loading Models & Tokenizers Manually**
While **pipeline** is great for quick tasks, for more control (e.g., when you want to fine-tune a model, access intermediate layers, or customize generation parameters), you'll load the **model** and its corresponding **tokenizer** separately. This is where the **Auto classes** come in.

### **Why is it called "auto"?**
It's **"auto"** because it intelligently loads the correct tokenizer architecture and configuration for any pre-trained model you specify, ensuring compatibility.

### **Why use Auto classes?**
1. **Simplicity:** You don't need to know the exact class name (e.g., BertTokenizer, GPT2Model). Auto handles it.
2. **Interoperability:** Easily swap models by just changing the model_name string.
3. **Compatibility:** Ensures that the tokenizer and model are correctly matched for a given pre-trained checkpoint.

### **Setting up a custom pipeline using Auto Classes**

We need to perform the following two steps:
1. Preprocessing Data using AutoTokenizer (i.e. Tokenize the input)
2. Use the AutoModel to solve the Task (i.e. Pass the input to appropriate Model)

### **Preprocessing Data using `AutoTokenizer` (i.e. Tokenize the input)**

The main purpose of **AutoTokenizer** is to convert raw text into numerical representations (tokens) that a model can understand. This is achieved by:
1. **Splitting text:** Breaking down text into words or sub-word units.
2. **Converting to IDs:** Mapping those units to numerical IDs from the model's vocabulary.
3. **Adding special tokens:** Adding tokens like [CLS] (start of sequence) or [SEP] (separator) required by certain models.
4. **Padding and Truncation:** Making all input sequences the same length (padding) or cutting them down (truncation) to fit the model's maximum input size.

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. Note that:
- **AutoTokenizer:** Text, use a `AutoTokenizer` to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
- **AutoFeatureExtractor:** Speech and audio, use a `AutoFeatureExtractor` to extract sequential features from audio waveforms and convert them into tensors.
- **AutoImageProcessor:** Image inputs use a `AutoImageProcessor` to convert images into tensors.
- **AutoProcessor:** Multimodal inputs, use a `AutoProcessor` to combine a tokenizer and a feature extractor or image processor.

**Note: `AutoProcessor` always works and automatically chooses the correct class for the model you’re using, whether you’re using a tokenizer, image processor, feature extractor or processor.**


### **Use the `AutoModel` to solve the Task (i.e. Pass the input to appropriate Model)**

The purpose of **AutoModel** is to load the pre-trained model weights. 

After the data has been preprocessed and converted to vectors, we can use the following pre-trained AutoModel classes for solving:
- [Natural Language Processing](https://huggingface.co/docs/transformers/model_doc/auto#natural-language-processing)
- [Computer Vision](https://huggingface.co/docs/transformers/model_doc/auto#computer-vision)
- [Audio](https://huggingface.co/docs/transformers/model_doc/auto#audio)
- [Multimodal](https://huggingface.co/docs/transformers/model_doc/auto#multimodal)

Similar to AutoTokenizer, it automatically infers the correct model architecture based on the model_name.

## **AutoTokenizer**

A tokenizer takes text as input and outputs numbers the associated model can make sense of.

Note that, the Auto classes are Hugging Face's genius way of making it incredibly simple to load the correct tokenizer and model architecture for any pre-trained checkpoint from the Hub.

Let's learn the step by step process now.

### **Using .from_pretrained("model_name")**

When you call `AutoTokenizer.from_pretrained("model_name")` the library does several things:

1. **Downloads Configuration:** It first downloads the **config.json** file associated with model_name from the Hugging Face Hub. This file contains metadata about the model (e.g., vocabulary size, number of layers, hidden dimension, the type of tokenizer it expects).
2. **Instantiates Class:** Based on the config.json and the Auto class you used (e.g., AutoTokenizer knows to look for **tokenizer_config.json**), it instantiates the correct specific class (e.g., BertTokenizerFast, GPT2Tokenizer) behind the scenes.
3. **Caches:** All downloaded files are cached locally on your system, so subsequent loads are much faster.

### **Configuration Files** 
- tokenizer_config.json
- config.json
- vocab.json
- merges.txt
- tokenizer.json

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

input_text = "Let's try to tokenize!"
print("Input Text:")
print(input_text)
print()

# Tokenize input
tokens = tokenizer(input_text)
print("Tokenized Output:")
print(tokens)
print()

# Decoding input_ids back to words
print("Decoded Text Output:")
print(tokenizer.decode(tokens["input_ids"]))



Input Text:
Let's try to tokenize!

Tokenized Output:
{'input_ids': [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Decoded Text Output:
[CLS] let's try to tokenize! [SEP]


### **What are input_ids, attention_mask and token_type_ids?**

The tokenizer returns a dictionary with three important items:

- **input_ids:** These are integer IDs of tokens from the tokenizer’s vocabulary. Each ID maps to a word or subword. The input ids are often the **only required parameters to be passed to the model as input**. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
- **attention_mask:** It tells the model which tokens to attend to (1) and which to ignore (0 for padding)
- **token_type_ids:** identifies which segment a token belongs to when there is more than one sequence.

**token_type_ids**  
- Some models (like BERT) are designed to take two distinct sequences as input for tasks such as Question Answering (Question + Context) or Next Sentence Prediction. token_type_ids distinguish between these two segments.
- token_type_ids helps the model understand which tokens belong to the first segment and which belong to the second.
- These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:
- Values:
    - 0: For tokens belonging to the first sequence.
    - 1: For tokens belonging to the second sequence.
```python
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
```

We can use AutoTokenizer to automatically generate such a sentence by passing the two sequences to tokenizer as two arguments (and not a list, like before) like this:

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

sequence_a = "HuggingFace is based in NYC"

sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)

decoded = tokenizer.decode(encoded_dict["input_ids"])

print("token_type_ids:")
print(encoded_dict["token_type_ids"])
print()

# Decoding input_ids back to words
print("Decoded Text Output:")
print(decoded)

token_type_ids:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Decoded Text Output:
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]




### **return_tensor Argument**
Finally, you want the tokenizer to return the actual tensors that get fed to the model.

In order to do that, we can set the `return_tensors` parameter to either **'pt' for PyTorch**, or **'tf' for TensorFlow**.  

`return_tensors`: Acceptable values are:
- 'tf': Return TensorFlow tf.constant objects.
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return Numpy np.ndarray objects.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

input_text = "Let's try to tokenize!"
print("Input Text:")
print(input_text)

Input Text:
Let's try to tokenize!


In [4]:
encoded_input = tokenizer(input_text)

print("Input Ids")
print(encoded_input["input_ids"])
print()

print("Output Type:")
print(type(encoded_input["input_ids"]))

Input Ids
[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]

Output Type:
<class 'list'>


In [5]:
encoded_input = tokenizer(input_text, return_tensors="np")

print("Input Ids")
print(encoded_input["input_ids"])
print()

print("Output Type:")
print(type(encoded_input["input_ids"]))

Input Ids
[[  101  2292  1005  1055  3046  2000 19204  4697   999   102]]

Output Type:
<class 'numpy.ndarray'>


In [6]:
encoded_input = tokenizer(input_text, return_tensors="pt")

print("Input Ids")
print(encoded_input["input_ids"])
print()

print("Output Type:")
print(type(encoded_input["input_ids"]))

Input Ids
tensor([[  101,  2292,  1005,  1055,  3046,  2000, 19204,  4697,   999,   102]])

Output Type:
<class 'torch.Tensor'>


In [7]:
encoded_input = tokenizer(input_text, return_tensors="tf")

print("Input Ids")
print(encoded_input["input_ids"])
print()

print("Output Type:")
print(type(encoded_input["input_ids"]))

Input Ids
tf.Tensor([[  101  2292  1005  1055  3046  2000 19204  4697   999   102]], shape=(1, 10), dtype=int32)

Output Type:
<class 'tensorflow.python.framework.ops.EagerTensor'>


### **Essential Configuration & Special Token Properties**

- **tokenizer.model_max_length:** The maximum length (in number of tokens) that the associated model is designed to handle. When loading a tokenizer with from_pretrained(), this value is often set based on the max_position_embeddings in the model's configuration. If no specific value is found, it might default to a very large integer (e.g., 1e30), indicating the tokenizer itself doesn't impose a hard limit, but the model still will.
- **tokenizer.vocab_size:** The total number of unique tokens in the tokenizer's vocabulary. This includes all the subword tokens and any special tokens.
- **tokenizer.is_fast:** A boolean indicating if this is a Rust-backed "Fast" tokenizer (which are generally recommended for speed and extra features like offset mapping).
- **tokenizer.padding_side:** Indicates whether padding should be applied to the 'right' (default for most encoder models like BERT) or 'left' (common for decoder models like GPT-2, where the model generates token by token from left to right).
- **tokenizer.truncation_side:** Indicates whether truncation should happen from the 'right' (default, cutting off the end) or 'left' (cutting off the beginning).
- **tokenizer.model_input_names:** A list of the expected input names for the model's forward pass (e.g., ['input_ids', 'attention_mask', 'token_type_ids']). This is useful for understanding what keys the tokenizer() method will return.

**Special Token Properties (and their IDs):**
- Tokenizers automatically add special tokens (like [CLS], [SEP], [PAD]) during tokenization. These properties store the string representation of these tokens and their corresponding numerical IDs in the vocabulary.
- **tokenizer.unk_token** and **tokenizer.unk_token_id:** Unknown token (for OOV words).
- **tokenizer.pad_token** and **tokenizer.pad_token_id:** Padding token.
- **tokenizer.cls_token** and **tokenizer.cls_token_id:** Classification token (e.g., BERT's start token).
- **tokenizer.sep_token** and **tokenizer.sep_token_id:** Separator token.
- **tokenizer.mask_token** and **tokenizer.mask_token_id:** Mask token (for masked language modeling).
- **tokenizer.bos_token** and **tokenizer.bos_token_id:** Beginning of sentence token.
- **tokenizer.eos_token** and **tokenizer.eos_token_id:** End of sentence token.
- **tokenizer.additional_special_tokens** and **tokenizer.additional_special_tokens_ids:** Any other special tokens added.

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print(f"Model Max Length: {tokenizer.model_max_length}")
print(f"Vocabulary Size: {tokenizer.vocab_size}")
print(f"Is Fast Tokenizer: {tokenizer.is_fast}")
print(f"Padding Side: {tokenizer.padding_side}")
print(f"Truncation Side: {tokenizer.truncation_side}")
print(f"Model Input Names: {tokenizer.model_input_names}")

print("\n--- Special Tokens ---")
print(f"[CLS] Token: '{tokenizer.cls_token}' (ID: {tokenizer.cls_token_id})")
print(f"[SEP] Token: '{tokenizer.sep_token}' (ID: {tokenizer.sep_token_id})")
print(f"[PAD] Token: '{tokenizer.pad_token}' (ID: {tokenizer.pad_token_id})")
print(f"[UNK] Token: '{tokenizer.unk_token}' (ID: {tokenizer.unk_token_id})")
print(f"[MASK] Token: '{tokenizer.mask_token}' (ID: {tokenizer.mask_token_id})")

Model Max Length: 512
Vocabulary Size: 30522
Is Fast Tokenizer: True
Padding Side: right
Truncation Side: right
Model Input Names: ['input_ids', 'token_type_ids', 'attention_mask']

--- Special Tokens ---
[CLS] Token: '[CLS]' (ID: 101)
[SEP] Token: '[SEP]' (ID: 102)
[PAD] Token: '[PAD]' (ID: 0)
[UNK] Token: '[UNK]' (ID: 100)
[MASK] Token: '[MASK]' (ID: 103)


In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

print(f"Model Max Length: {tokenizer.model_max_length}")
print(f"Vocabulary Size: {tokenizer.vocab_size}")
print(f"Is Fast Tokenizer: {tokenizer.is_fast}")
print(f"Padding Side: {tokenizer.padding_side}")
print(f"Truncation Side: {tokenizer.truncation_side}")
print(f"Model Input Names: {tokenizer.model_input_names}")

print("\n--- Special Tokens ---")
print(f"[CLS] Token: '{tokenizer.cls_token}' (ID: {tokenizer.cls_token_id})")
print(f"[SEP] Token: '{tokenizer.sep_token}' (ID: {tokenizer.sep_token_id})")
print(f"[PAD] Token: '{tokenizer.pad_token}' (ID: {tokenizer.pad_token_id})")
print(f"[UNK] Token: '{tokenizer.unk_token}' (ID: {tokenizer.unk_token_id})")
print(f"[MASK] Token: '{tokenizer.mask_token}' (ID: {tokenizer.mask_token_id})")

Model Max Length: 512
Vocabulary Size: 50265
Is Fast Tokenizer: True
Padding Side: right
Truncation Side: right
Model Input Names: ['input_ids', 'attention_mask']

--- Special Tokens ---
[CLS] Token: '<s>' (ID: 0)
[SEP] Token: '</s>' (ID: 2)
[PAD] Token: '<pad>' (ID: 1)
[UNK] Token: '<unk>' (ID: 3)
[MASK] Token: '<mask>' (ID: 50264)


## **Inspecting Tokenization Step by Step**
<img style="float: right;" width="400" height="400" src="data/images/tokenization.JPG">

- **Step 1: Split input_text to tokens**
    - Create **tokens** using `tokenizer.tokinize(input_text)`. It helps to split input_text to tokens.
- **Step 2: Convert the tokens to numerical IDs**
    - Use `tokenizer.convert_tokens_to_ids(tokens)` to convert tokens to integer IDs. Each ID maps to a word or subword.
- **Step 3: Append special tokens the model expects**
    - Append special tokens the model expects using `tokenizer.prepare_for_model(input_ids)`.
- **Step 4: Decoding input_ids back to words**
    - Decode the final output using `tokenizer.decode(input_ids_with_special_tokens)`.

### **Step 1: Split input_text to tokens**

**Key Point**  
- The ## prefix indicates a subword piece that belongs to the previous token.

In [10]:
## Step 1: Split input_text to tokens
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

print(f"Tokenizer's default max_length (model_max_length): {tokenizer.model_max_length}")
print()

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)

Tokenizer's default max_length (model_max_length): 512

Input Text: Let's try to tokenize!

Tokens: ['let', "'", 's', 'try', 'to', 'token', '##ize', '!']


### **Step 2: Convert the tokens to numerical IDs**

In [11]:
## Step 2: Convert the tokens to numerical IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)

Tokens Id: [2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]


### **Step 3: Append special tokens the model expects**

In [12]:
## Step 3: Lastly, the tokenizer adds special tokens the model expects
out = tokenizer.prepare_for_model(input_ids)
input_ids_with_special_tokens = out["input_ids"]
print("Tokens Id with special tokens:", input_ids_with_special_tokens)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Tokens Id with special tokens: [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]


### **Step 4: Decoding input_ids back to words**

In [13]:
## Step 4: Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(input_ids_with_special_tokens))

Decoded Text Output: [CLS] let's try to tokenize! [SEP]


### **Example: AutoTokenizer for "roberta-base" model**

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

print(f"Tokenizer's default max_length (model_max_length): {tokenizer.model_max_length}")
print()

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)
print()

## Convert the tokens to unique numerical number
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)
print()

## Lastly, the tokenizer adds special tokens the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print("Tokens Id with special tokens:", final_inputs["input_ids"])
print()

## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(final_inputs["input_ids"]))

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Tokenizer's default max_length (model_max_length): 512

Input Text: Let's try to tokenize!

Tokens: ['Let', "'s", 'Ġtry', 'Ġto', 'Ġtoken', 'ize', '!']

Tokens Id: [7939, 18, 860, 7, 19233, 2072, 328]

Tokens Id with special tokens: [0, 7939, 18, 860, 7, 19233, 2072, 328, 2]

Decoded Text Output: <s>Let's try to tokenize!</s>


**Important Note: with Ġ indicating start of word**

## **Batching, Padding and Truncation**

These concepts are critical for preparing data efficiently for deep learning models, especially when dealing with variable-length sequences like text.

Reference: https://huggingface.co/docs/transformers/en/pad_truncation

### **Batching**
- Deep learning models, particularly when running on GPUs, are highly optimized to process data in batches (collections of multiple input samples) rather than one sample at a time.
- This improves efficiency and speed of computation.
    - Efficiency: GPUs perform parallel computations very well. Processing multiple samples at once keeps the GPU busy and utilizes its power effectively.
    - Speed: Reduces overhead from repeatedly transferring data between CPU and GPU.
- Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. **Padding and truncation are strategies for** dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a **special padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by **truncating long sequences**.

In [15]:
batched_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
    "This is a much longer sentence that will definitely need to be truncated.",
    "Another sentence."
]

batched_sentences

['But what about second breakfast?',
 "Don't think he knows about second breakfast, Pip.",
 'What about elevensies?',
 'This is a much longer sentence that will definitely need to be truncated.',
 'Another sentence.']

### **Padding**
- When you batch multiple sequences of text, they almost always have different lengths. 
- However, neural networks require fixed-size input tensors. 
- Padding involves adding special [PAD] tokens to the shorter sequences in a batch to make them all the same length as the longest sequence in that batch.
- The `padding` argument controls padding. It can be a boolean or a string::
    - `padding=True or padding='longest':` Pads all sequences in the batch to the length of the longest sequence.
    - `padding='max_length':` Pads all sequences to a specified maximum length (e.g., 512, which is common for BERT).
    - `padding=False or padding='do_not_pad':` no padding is applied. This is the default behavior.
- Crucial Role of attention_mask: The attention_mask becomes essential here. It tells the model to ignore the padded tokens so they don't influence the model's computations or predictions.

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print(f"Tokenizer's default max_length (model_max_length): {tokenizer.model_max_length}") 
# Often 512 or 1e30 (meaning no hard limit set by tokenizer itself)

Tokenizer's default max_length (model_max_length): 512


In [17]:
# Pad shorter sequences to the length of the longest in the batch
tokenized_inputs = tokenizer(batched_sentences, padding=True, return_tensors="pt")

# Let's check the input ids
print("\n--- Processed Inputs (Batch) ---")
print(f"Input IDs:\n{tokenized_inputs['input_ids']}")
print(f"Shape of Input IDs: {tokenized_inputs['input_ids'].shape}") # Should be (batch_size, max_length)

# Let's decode each sequence to see the effect
print("\n--- Decoded Sequences ---")
for i, input_ids in enumerate(tokenized_inputs['input_ids']):
    decoded_text = tokenizer.decode(input_ids)
    print(f"Sequence {i+1}: {decoded_text}")

print("\n--- Understanding the shapes ---")
print(f"Batch Size (Number of sentences): {tokenized_inputs['input_ids'].shape[0]}")
print(f"Sequence Length (after padding/truncation): {tokenized_inputs['input_ids'].shape[1]}")


--- Processed Inputs (Batch) ---
Input IDs:
tensor([[  101,  2021,  2054,  2055,  2117,  6350,  1029,   102,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  2123,  1005,  1056,  2228,  2002,  4282,  2055,  2117,  6350,
          1010, 28315,  1012,   102,     0,     0],
        [  101,  2054,  2055,  5408, 14625,  1029,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  2023,  2003,  1037,  2172,  2936,  6251,  2008,  2097,  5791,
          2342,  2000,  2022, 25449,  1012,   102],
        [  101,  2178,  6251,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]])
Shape of Input IDs: torch.Size([5, 16])

--- Decoded Sequences ---
Sequence 1: [CLS] but what about second breakfast? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Sequence 2: [CLS] don't think he knows about second breakfast, pip. [SEP] [PAD] [PAD]
Sequence 3: [CLS] what about elevensies? 

### **Truncation**
- On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.
- Models have a maximum input length they can handle (e.g., 512 tokens for bert-base-uncased).
- If a sequence is longer than this maximum length, truncation involves cutting off the excess tokens from the end of the sequence.
- The `truncation` argument controls truncation. It can be a boolean or a string:
    - `truncation=True or truncation='longest_first':` Automatically truncates sequences that exceed the model's maximum length. If `max_length` is not specified, it defaults to the model's `model_max_length`
    - `truncation='max_length':` You can explicitly set the maximum length to which sequences should be padded and truncated.
    - `truncation=False or truncation='do_not_truncate':` no truncation is applied. This is the default behavior.

In real scenarios, you'd often use `model.config.max_position_embeddings` to get the model's true maximum capacity.

In [18]:
# Pad shorter sequences to the length of the longest in the batch
try:
    tokenized_inputs = tokenizer(batched_sentences, truncation=True, return_tensors="pt")
except Exception as e:
    print(f"\nAn error occurred: {e}")


An error occurred: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).


### **Fixing the Error: "Unable to create tensor"**

When batched_sentences is a list of multiple sentences, they will almost certainly have different lengths. For the tokenizer to combine them into a single, uniform tensor (a batch), all sequences must be of the same length.
- `truncation=True` makes sure no sentence is too long.
- `padding=True` makes sure no sentence is too short (it adds padding tokens).

Without **padding=True**, if your sentences have different lengths, the tokenizer cannot create a single rectangular PyTorch tensor for the batch.

In [19]:
tokenized_inputs = tokenizer(batched_sentences, 
                             truncation=True, 
                             padding=True, 
                             return_tensors="pt")

# Let's check the input ids
print("\n--- Processed Inputs (Batch) ---")
print(f"Input IDs:\n{tokenized_inputs['input_ids']}")
print(f"Shape of Input IDs: {tokenized_inputs['input_ids'].shape}") # Should be (batch_size, max_length)

# Let's decode each sequence to see the effect
print("\n--- Decoded Sequences ---")
for i, input_ids in enumerate(tokenized_inputs['input_ids']):
    decoded_text = tokenizer.decode(input_ids)
    print(f"Sequence {i+1}: {decoded_text}")

print("\n--- Understanding the shapes ---")
print(f"Batch Size (Number of sentences): {tokenized_inputs['input_ids'].shape[0]}")
print(f"Sequence Length (after padding/truncation): {tokenized_inputs['input_ids'].shape[1]}")


--- Processed Inputs (Batch) ---
Input IDs:
tensor([[  101,  2021,  2054,  2055,  2117,  6350,  1029,   102,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  2123,  1005,  1056,  2228,  2002,  4282,  2055,  2117,  6350,
          1010, 28315,  1012,   102,     0,     0],
        [  101,  2054,  2055,  5408, 14625,  1029,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  2023,  2003,  1037,  2172,  2936,  6251,  2008,  2097,  5791,
          2342,  2000,  2022, 25449,  1012,   102],
        [  101,  2178,  6251,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]])
Shape of Input IDs: torch.Size([5, 16])

--- Decoded Sequences ---
Sequence 1: [CLS] but what about second breakfast? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Sequence 2: [CLS] don't think he knows about second breakfast, pip. [SEP] [PAD] [PAD]
Sequence 3: [CLS] what about elevensies? 

### **Most Common Practice: Using truncation and padding with max_length Argument**
- The `max_length` argument in the Hugging Face `tokenizer()` method is a crucial parameter that defines the maximum sequence length for both padding and truncation.
- The `max_length` argument works in conjunction with both `padding` and `truncation` arguments to prepare your text data for the model.
- padding=True (or 'longest' / 'max_length') + max_length specified:
    - If a sequence is shorter than the max_length value, the tokenizer will add [PAD] tokens to the end of that sequence until it reaches the specified max_length.
    - This ensures all sequences in your batch have the same length, which is a requirement for creating uniform tensors that can be fed into a deep learning model.
- truncation=True + max_length specified:
    - If a sequence is longer than the max_length value, the tokenizer will cut off (truncate) the excess tokens from the end of that sequence until it fits the specified max_length.
    - This is essential because models have an inherent maximum context window or max_position_embeddings (e.g., 512 for bert-base-uncased). Passing a sequence longer than this limit would result in an error or undefined behavior.

In most cases, padding your batch to the length of the longest sequence, and truncating to the maximum length a model can accept, works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: `padding`, `truncation` and `max_length`.


In [20]:
from transformers import AutoTokenizer

# We'll use a common BERT tokenizer for demonstration
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
    "This is a much longer sentence that will definitely need to be truncated.",
    "Another sentence."
]

print(f"Tokenizer's default max_length (model_max_length): {tokenizer.model_max_length}") 

# --- Padding and Truncation to a custom max_length (most common for fixed input size) ---
print("\n--- Padding & Truncation to max_length=20 (most common) ---")
inputs_combined = tokenizer(
    texts,
    padding='max_length', # Pad all sequences to max_length
    truncation=True,      # Truncate sequences longer than max_length
    max_length=20,        # The target fixed length
    return_tensors="pt"
)
print("\nInput IDs (combined - all same length):")
for i, ids in enumerate(inputs_combined['input_ids']):
    print(f"Seq {i+1} Length: {len(ids)}, IDs: {ids}")
    print(f"Decoded: {tokenizer.decode(ids, skip_special_tokens=False)}") # Show PAD tokens
print(f"\nFinal tensor shape: {inputs_combined['input_ids'].shape}") # All sequences are now (batch_size, 20)

Tokenizer's default max_length (model_max_length): 512

--- Padding & Truncation to max_length=20 (most common) ---

Input IDs (combined - all same length):
Seq 1 Length: 20, IDs: tensor([ 101, 2021, 2054, 2055, 2117, 6350, 1029,  102,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0])
Decoded: [CLS] but what about second breakfast? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Seq 2 Length: 20, IDs: tensor([  101,  2123,  1005,  1056,  2228,  2002,  4282,  2055,  2117,  6350,
         1010, 28315,  1012,   102,     0,     0,     0,     0,     0,     0])
Decoded: [CLS] don't think he knows about second breakfast, pip. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Seq 3 Length: 20, IDs: tensor([  101,  2054,  2055,  5408, 14625,  1029,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])
Decoded: [CLS] what about elevensies? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

## **AutoModel - The Universal Model Loader**

Just as AutoTokenizer simplifies text preprocessing, AutoModel is your gateway to interacting with the pre-trained brain of a Transformer network.

AutoModel is a class from the transformers library designed to automatically load the correct pre-trained model architecture and its weights based on a given checkpoint name (e.g., "bert-base-uncased", "gpt2", "facebook/bart-large").

It loads:
- The weights
- The architecture class (like BertModel)
- The configuration (like hidden_size, num_layers, etc.)

But note that, AutoModel gives you the raw model without task-specific heads.


### **Using .from_pretrained("model_name")**

When you call AutoModel.from_pretrained("model_name") the library does several things:
1. **Configuration Download (`config.json`):** When you call AutoModel.from_pretrained("model_name"), the library first looks for and downloads the `config.json` file from the Hugging Face Hub for that model_name. This JSON file contains all the architectural blueprints and hyperparameters of the model (e.g., number of layers, hidden dimension, attention heads, vocabulary size, the task it's meant for).
2. **Class Instantiation:** Based on the model_type specified in config.json (e.g., "bert"), AutoModel dynamically determines the correct Python class to instantiate (e.g., transformers.BertModel).
3. **Weight Download:** Once the architecture is known, it downloads the actual pre-trained model weights (usually large binary files like **`pytorch_model.bin`** or **`tf_model.h5`**). These weights contain the knowledge the model acquired during its extensive pre-training on vast amounts of data.
4. **Local Caching:** All downloaded files (config, weights) are stored in your local Hugging Face cache directory (usually `~/.cache/huggingface/transformers`), so subsequent loads of the same model are much faster.
5. **Model Loading:** The model instance is created, and the downloaded weights are loaded into its layers.

### **Configuration Files**
- model.safetensors

In [40]:
from transformers import AutoModel, AutoTokenizer

model_checkpoint = "google-bert/bert-base-uncased"

model = AutoModel.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



### **What are these warnings?**
- These warnings are coming from the PyTorch backend, typically when a model is converted from a TensorFlow-style checkpoint (e.g., from .ckpt or .h5) into PyTorch, or during custom loading of model weights where the layer names do not fully align with PyTorch naming conventions.
- In TensorFlow / Keras, the common names used in Batch Normalization or Layer Normalization layers are:
    - gamma → scale → maps to weight in PyTorch
    - beta → offset → maps to bias in PyTorch
- When Hugging Face tries to load these weights into a PyTorch model, it renames them internally for compatibility.
- No need to worry about these warnings. Your model will still load and run correctly. It won’t affect performance, outputs, or fine-tuning.


In [41]:
inputs = tokenizer("What will be the output of model?", return_tensors="pt")
outputs = model(**inputs)
print(outputs.keys())

odict_keys(['last_hidden_state', 'pooler_output'])


### **What is last_hidden_state and pooler_output?**

**last_hidden_state (Embeddings for each token)**
- It's the final output from the last encoder layer of the transformer. It contains token-level contextual embeddings.
- Tensor of shape **(batch_size, seq_len, hidden_size)**
- Used in NER, attention, sentence embeddings

**pooler_output (Sentence-Level Embedding)**
- It is a summary representation of the entire sentence, derived from the [CLS] token.
- Specifically:
    - Take the embedding of [CLS] from last_hidden_state
    - Pass it through a dense (Linear) layer + Tanh activation
- This gives you a fixed-size sentence vector.
- Tensor of shape **(batch_size, hidden_size)**
- Some models like DistilBERT, RoBERTa may not include pooler_output by default.

**Example**  

Let's say input is: "I love GenAI!"

For the above input, last_hidden_state and pooler_output will be as follows:

- last_hidden_state contains:
    - [CLS] → Vector 1
    - I → Vector 2
    - love → Vector 3
    - GenAI → Vector 4
    - ! → Vector 5
    - [SEP] → Vector 6
- pooler_output = Tanh(Dense(Vector 1))



### **Essential Model Configuration and Architecture**
- You can access the model's configuration via its .config attribute. The config object contains all the architectural details and hyperparameters.
- **model.config.model_type:** This defines the type of model architecture used, such as: "bert", "roberta", "gpt2", "t5", "distilbert", "bloom", etc. It helps Hugging Face determine:
    - What tokenizer class to use
    - Which model architecture to load
    - How to handle special tokens like [CLS], [SEP], etc.
- **model.config.vocab_size:** The number of unique tokens (words, subwords, or characters) the model knows. For eg:
    - BERT: 30522
    - GPT-2: 50257
    - RoBERTa: 50265
    - T5: 32128
- **model.config.num_attention_heads:** Number of self-attention heads in each Transformer layer. For eg:
    - BERT-base: 12 heads
    - BERT-large: 16 heads
    - GPT-2: 12, 24, or 32 depending on size
- **model.config.num_hidden_layers:** The number of Transformer encoder (or decoder) layers in the model. Each layer contains Multi-head self-attention, Feed-forward neural network and Layer norm + residual connections. For eg:
    - BERT-base: 12 layers
    - BERT-large: 24 layers
    - GPT-2 medium: 24 layers
- **model.config.hidden_size:** The size of each hidden layer’s output vector and the embedding dimension. It controls: Size of token embeddings, Size of [CLS] vector, and Input/output shape of attention blocks. Bigger hidden_size = better learning capacity, but slower inference and training. For eg:
    - BERT-base: 768
    - BERT-large: 1024
    - GPT-2: 768, 1024, or 1600
- **model.config.max_position_embeddings:** This defines the maximum input sequence length the model can handle. Each token in the input gets a positional embedding based on its position (1st token, 2nd token, etc.). If your sequence is longer than this → it will be truncated or need special handling like chunking or sliding window. Typical values:
    - BERT: 512
    - RoBERTa: 514
    - GPT-2: 1024
- **model.training:** This is a PyTorch flag that indicates whether the model is in training mode (True) or evaluation mode (False).
    - Training mode: Dropout is enabled
    - Evaluation mode: Dropout is disabled
```python
model.train()  # enables training mode
model.eval()   # sets evaluation mode
```

In [36]:
print("\n--- Model Configuration ---")
print(f"  Model Type: {model.config.model_type}")
print(f"  Vocabulary Size: {model.config.vocab_size}")
print(f"  Number of Attention Heads: {model.config.num_attention_heads}")
print(f"  Number of Layers: {model.config.num_hidden_layers}")
print(f"  Hidden Size (Embedding Dimension): {model.config.hidden_size}")
print(f"  Max Position Embeddings (max sequence length it can handle): {model.config.max_position_embeddings}")


--- Model Configuration ---
  Model Type: bert
  Vocabulary Size: 30522
  Number of Attention Heads: 12
  Number of Layers: 12
  Hidden Size (Embedding Dimension): 768
  Max Position Embeddings (max sequence length it can handle): 512


In [45]:
# You can also see if it's currently in training or evaluation mode
print(f"Model is in training mode (default): {model.training}")

# It's good practice to set the model to evaluation mode for inference
model.eval()
print(f"Model is in evaluation mode: {model.training}")

Model is in training mode (default): False
Model is in evaluation mode: False


In [44]:
print("----Model Architecture----")
print(model)

----Model Architecture----
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): D

## **Core AutoModel vs. Task-Specific AutoModelFor...**

This is a critical distinction! The transformers library provides various AutoModelFor... classes, each tailored for a specific downstream task. They all share the same pre-trained backbone but differ in the "head" (the final layers) added on top.

| Class                                | Use Case                         | Architecture                      |
| ------------------------------------ | -------------------------------- | --------------------------------- |
| `AutoModel`                          | Raw model for feature extraction | BERT, RoBERTa, GPT, etc.          |
| `AutoModelForSequenceClassification` | Text classification              | Adds a classification head        |
| `AutoModelForTokenClassification`    | NER, POS tagging                 | Token-wise classification         |
| `AutoModelForQuestionAnswering`      | QnA tasks like SQuAD             | Outputs start/end logits          |
| `AutoModelForCausalLM`               | Text generation (GPT)            | Decoder-only LM                   |
| `AutoModelForMaskedLM`               | Fill-in-the-blank (BERT-style)   | Masked token prediction           |
| `AutoModelForSeq2SeqLM`              | Translation, Summarization       | Encoder-decoder models (T5, BART) |

## **AutoModelFor`*`, TFAutoModelFor`*` and FlaxAutoModelFor`*`**

We will show how to use those briefly, following this pattern:

* Given input articles.
* Tokenize them (converting to token indices).
* Apply the model on the tokenized data to generate summaries (represented as token indices).
* Decode the summaries into human-readable text.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd


# Load the pre-trained tokenizer.
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Load the pre-trained model.
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

In [None]:
# For summarization, T5-small expects a prefix "summarize: ", 
# so we prepend that to each article as a prompt.

articles = list(map(lambda article: "summarize: " + article, xsum_sample["document"]))

pd.DataFrame(articles, columns=["prompts"])

In [None]:
# Tokenize the input

inputs = tokenizer(
    articles, return_tensors="pt", padding=True, truncation=True, max_length=1024
)

print("input_ids:")
print(inputs["input_ids"])
print("attention_mask:")
print(inputs["attention_mask"])

In [None]:
# Generate summaries

summary_ids = model.generate(
                inputs.input_ids,
                attention_mask=inputs.attention_mask,
                num_beams=2,
                min_length=0,
                max_length=40,
)

print(summary_ids)

In [None]:
# Decode the generated summaries

decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

pd.DataFrame(decoded_summaries, columns=["decoded_summaries"])

## **Fine-Tunning**

https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras