# 🔧 Transformers Basics - SOLUTIONS

**Module 01 | Notebook 1 of 2**

> ⚠️ **Note**: This notebook contains solutions to the student challenges. Try to complete the challenges on your own first using `01_transformers_basics.ipynb`!

In this notebook, you'll learn the fundamental building blocks of working with transformer models using the Hugging Face Transformers library.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Load pre-trained models using `AutoModel` and `AutoTokenizer`
2. Understand the tokenization process
3. Perform model inference
4. Use pipelines for common tasks

---

## Setup

In [1]:
%%capture
!pip install transformers datasets torch accelerate
print("✅ Dependencies installed!")

In [2]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM
import warnings
warnings.filterwarnings('ignore')

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


---

## Understanding Tokenization

### What is Tokenization?

Tokenization is the process of converting text into smaller units (tokens) that the model can process. These tokens are then converted to numerical IDs.

```
Text: "Hello, how are you?"
  ↓ Tokenization
Tokens: ["Hello", ",", "how", "are", "you", "?"]
  ↓ Convert to IDs
Token IDs: [7592, 1010, 2129, 2024, 2017, 1029]
```

### Why Tokenization Matters

- Models can only process numbers, not text
- Different models use different tokenization strategies
- Token count affects memory usage and processing time

In [3]:
# Load a tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Simple tokenization example
text = "Hello, how are you doing today?"

# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Original text: Hello, how are you doing today?
Tokens: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
Number of tokens: 8


In [4]:
# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")

# We can also go back from IDs to tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(f"Decoded tokens: {decoded_tokens}")

Token IDs: [7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029]
Decoded tokens: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']


### The Complete Tokenization Pipeline

In practice, we use the tokenizer's `__call__` method which handles everything:

The `__call__` method is the recommended way to tokenize text because it handles the complete tokenization pipeline in one step. When you call `tokenizer(text, return_tensors="pt")`, it performs multiple operations: converting text to tokens, mapping tokens to IDs, adding special tokens (like `[CLS]` and `[SEP]`), generating the attention mask, and converting the output to PyTorch tensors.

In [5]:
# Complete tokenization with the __call__ method
encoded = tokenizer(text, return_tensors="pt")

print("Encoded outputs:")
print(f"  Keys: {list(encoded.keys())}")
print(f"  input_ids shape: {encoded['input_ids'].shape}")
print(f"  input_ids: {encoded['input_ids']}")
print(f"  attention_mask: {encoded['attention_mask']}")

Encoded outputs:
  Keys: ['input_ids', 'token_type_ids', 'attention_mask']
  input_ids shape: torch.Size([1, 10])
  input_ids: tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029,  102]])
  attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


### Understanding the Outputs

| Field | Description |
|-------|-------------|
| `input_ids` | Token IDs for the model |
| `attention_mask` | 1s for real tokens, 0s for padding |
| `token_type_ids` | Segment IDs (for sentence pairs) |

In [6]:
# Visualize the tokenization
print("Token-by-token breakdown:")
print("-" * 40)
for token_id, attention in zip(encoded['input_ids'][0], encoded['attention_mask'][0]):
    token = tokenizer.convert_ids_to_tokens([token_id.item()])[0]
    print(f"ID: {token_id.item():5d} | Attention: {attention.item()} | Token: {token}")

Token-by-token breakdown:
----------------------------------------
ID:   101 | Attention: 1 | Token: [CLS]
ID:  7592 | Attention: 1 | Token: hello
ID:  1010 | Attention: 1 | Token: ,
ID:  2129 | Attention: 1 | Token: how
ID:  2024 | Attention: 1 | Token: are
ID:  2017 | Attention: 1 | Token: you
ID:  2725 | Attention: 1 | Token: doing
ID:  2651 | Attention: 1 | Token: today
ID:  1029 | Attention: 1 | Token: ?
ID:   102 | Attention: 1 | Token: [SEP]


### Special Tokens

Models use special tokens to mark the beginning/end of sequences:

* __[CLS] (Classification Token)__ is automatically placed at the beginning of every input sequence and serves as an aggregate representation of the entire sequence. During training, the model learns to encode information from all tokens into this first position through bidirectional attention, making it ideal for sequence-level classification tasks where a single vector representation is needed.​

* __[SEP] (Separator Token)__ marks boundaries between different segments in the input. For single sentences, it appears at the end (`[CLS] sentence [SEP]`), while for sentence pairs it separates both sequences (`[CLS] sentence1 [SEP] sentence2 [SEP]`). This is essential for tasks like question answering or sentence similarity where the model needs to distinguish between multiple text segments.​

* __[PAD] (Padding Token)__ is used when batching sequences of different lengths to ensure uniform dimensions. Since transformer models process batches efficiently, shorter sequences are padded to match the longest sequence in the batch, and the attention mask ensures these padding tokens are ignored during computation.​

* __[UNK] (Unknown Token)__ represents any word or subword that doesn't exist in the model's vocabulary. When the tokenizer encounters an out-of-vocabulary term, it substitutes this token, though modern subword tokenization methods like WordPiece minimize this occurrence.

In [7]:
print("Special tokens:")
print(f"  [CLS] token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"  [SEP] token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"  [PAD] token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"  [UNK] token: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")

Special tokens:
  [CLS] token: [CLS] (ID: 101)
  [SEP] token: [SEP] (ID: 102)
  [PAD] token: [PAD] (ID: 0)
  [UNK] token: [UNK] (ID: 100)


---

## Loading Pre-trained Models

### The Auto Classes

When you call AutoModel.from_pretrained(model_name), the Auto class examines the model_type property in the model's configuration file and automatically selects the appropriate specific model class (e.g., BERT, GPT-2, RoBERTa). This pattern matching makes your code model-agnostic, allowing you to easily switch between different architectures by simply changing the model name without modifying any other code.

Hugging Face provides `Auto` classes that automatically detect the correct model architecture:

| Class | Use Case |
|-------|----------|
| `AutoModel` | Base model (embeddings only) |
| `AutoModelForSequenceClassification` | Text classification |
| `AutoModelForSeq2SeqLM` | Sequence-to-sequence (summarization, translation) |
| `AutoModelForCausalLM` | Text generation (GPT-style) |
| `AutoModelForQuestionAnswering` | Extractive QA |


* `AutoModel`: Returns the base transformer model without any task-specific head, outputting raw hidden states and embeddings useful for feature extraction or custom downstream tasks

* `AutoModelForSequenceClassification`: Adds a classification head on top of the base model for tasks like sentiment analysis, spam detection, or multi-class categorization

* `AutoModelForSeq2SeqLM`: Designed for encoder-decoder architectures like T5 or BART, handling sequence-to-sequence tasks such as translation, summarization, or paraphrasing

* `AutoModelForCausalLM`: Used for autoregressive language models like GPT that generate text by predicting the next token, ideal for text completion and generation

* `AutoModelForQuestionAnswering`: Specialized for extractive question answering tasks where the model identifies answer spans within a given context passage



In [8]:
# Load the base BERT model
model = AutoModel.from_pretrained(model_name)

print(f"Model type: {type(model).__name__}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Model type: BertModel
Number of parameters: 109,482,240


In [9]:
# Get model configuration
config = model.config

print("Model configuration:")
print(f"  Hidden size: {config.hidden_size}")
print(f"  Number of layers: {config.num_hidden_layers}")
print(f"  Number of attention heads: {config.num_attention_heads}")
print(f"  Vocabulary size: {config.vocab_size}")
print(f"  Max position embeddings: {config.max_position_embeddings}")

Model configuration:
  Hidden size: 768
  Number of layers: 12
  Number of attention heads: 12
  Vocabulary size: 30522
  Max position embeddings: 512


### Model Inference

Let's pass our tokenized text through the model:

In [10]:
# Move model to device
model = model.to(device)

# Prepare inputs
inputs = tokenizer(text, return_tensors="pt").to(device)

# Run inference (no gradient computation needed)
with torch.no_grad():
    outputs = model(**inputs)

print(f"Output keys: {list(outputs.keys())}")
print(f"Last hidden state shape: {outputs.last_hidden_state.shape}")
print(f"  - Batch size: {outputs.last_hidden_state.shape[0]}")
print(f"  - Sequence length: {outputs.last_hidden_state.shape[1]}")
print(f"  - Hidden size: {outputs.last_hidden_state.shape[2]}")

Output keys: ['last_hidden_state', 'pooler_output']
Last hidden state shape: torch.Size([1, 10, 768])
  - Batch size: 1
  - Sequence length: 10
  - Hidden size: 768


### Understanding the Output

The last_hidden_state is the output of the final transformer layer and contains contextualized embeddings for every token in the input sequence. Unlike static word embeddings, these representations are context-aware, meaning the same word gets different embeddings depending on its surrounding context. The shape `[batch_size, sequence_length, hidden_size]` corresponds to `[1, 10, 768`] where 768 is BERT-base's hidden dimension.

The `last_hidden_state` contains embeddings for each token:

```
Shape: [batch_size, sequence_length, hidden_size]
       [1,          10,              768]
```

Each token is now represented as a 768-dimensional vector that captures its meaning in context.

In [11]:
# Extract the [CLS] token embedding (often used for classification)
cls_embedding = outputs.last_hidden_state[0, 0, :]  # First token of first batch
print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"CLS embedding (first 10 values): {cls_embedding[:10]}")

CLS embedding shape: torch.Size([768])
CLS embedding (first 10 values): tensor([ 0.0051, -0.0445, -0.2543, -0.1362, -0.0878, -0.4347,  0.5267,  0.4450,
         0.1334, -0.1693], device='cuda:0')


---

## Task-Specific Models

For specific tasks, use the appropriate model class:

In [12]:
# Load a classification model
classifier = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).to(device)

classifier_tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

print(f"Number of labels: {classifier.config.num_labels}")
print(f"Label mapping: {classifier.config.id2label}")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Number of labels: 2
Label mapping: {0: 'NEGATIVE', 1: 'POSITIVE'}


In [13]:
# Run classification
test_texts = [
    "I absolutely love this product!",
    "This is the worst experience ever.",
    "It's okay, nothing special."
]

for text in test_texts:
    inputs = classifier_tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = classifier(**inputs)
    
    # Get probabilities
    probs = torch.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probs).item()
    confidence = probs[0, predicted_class].item()
    
    print(f"Text: {text}")
    print(f"  → {classifier.config.id2label[predicted_class]} ({confidence:.2%})")
    print()

Text: I absolutely love this product!
  → POSITIVE (99.99%)

Text: This is the worst experience ever.
  → NEGATIVE (99.98%)

Text: It's okay, nothing special.
  → NEGATIVE (81.90%)



---

## Using Pipelines

For quick prototyping, use the `pipeline` API which abstracts away tokenization and post-processing:

The pipeline handles three critical steps automatically: preprocessing (tokenization), model inference (forward pass), and postprocessing (converting model outputs to human-readable results). You simply pass raw text, images, or audio as input, and receive task-specific outputs. For example, `pipeline("sentiment-analysis")` automatically selects a default model, loads the appropriate tokenizer, and returns sentiment labels with confidence scores.

__Using Pipeline__
You can instantiate a pipeline in two ways:​
* __By task__: `pipeline(task="text-generation")` uses the default model for that task
* __By model__: `pipeline(model="bert-base-uncased")` automatically detects the task from the model's configuration

In [14]:
from transformers import pipeline

# Sentiment analysis pipeline
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    device=0 if torch.cuda.is_available() else -1
)

results = sentiment_pipeline(test_texts)

print("Pipeline Results:")
for text, result in zip(test_texts, results):
    print(f"{text}")
    print(f"  → {result['label']} ({result['score']:.2%})\n")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Pipeline Results:
I absolutely love this product!
  → POSITIVE (99.99%)

This is the worst experience ever.
  → NEGATIVE (99.98%)

It's okay, nothing special.
  → NEGATIVE (81.90%)



In [15]:
# Summarization pipeline
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0 if torch.cuda.is_available() else -1
)

long_text = """
The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest 
in the Amazon biome that covers most of the Amazon basin of South America. This basin 
encompasses 7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) 
are covered by the rainforest. This region includes territory belonging to nine nations 
and 3,344 formally acknowledged indigenous territories. The majority of the forest is 
contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia 
with 10%, and with minor amounts in Bolivia, Ecuador, French Guiana, Guyana, Suriname, 
and Venezuela.
"""

summary = summarizer(long_text, max_length=50, min_length=20, do_sample=False)
print(f"Original length: {len(long_text.split())} words")
print(f"Summary: {summary[0]['summary_text']}")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Original length: 96 words
Summary: The Amazon rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest that covers most of the Amazon basin of South America. This region includes territory belonging to nine nations and 3,344 formally acknowledged indigenous territories.


### Available Pipelines

| Pipeline | Task |
|----------|------|
| `text-classification` | Sentiment, topic classification |
| `token-classification` | NER, POS tagging |
| `question-answering` | Extractive QA |
| `summarization` | Text summarization |
| `translation` | Machine translation |
| `text-generation` | GPT-style generation |
| `fill-mask` | Masked language modeling |

---

## Batch Processing

For efficiency, process multiple inputs at once:

In [16]:
# Batch tokenization with padding
texts = [
    "Short text.",
    "This is a medium length sentence.",
    "This is a much longer sentence that contains many more words and tokens."
]

# Tokenize with padding
batch_encoded = tokenizer(
    texts,
    padding=True,           # Pad to longest in batch
    truncation=True,        # Truncate if too long
    max_length=32,          # Maximum length
    return_tensors="pt"     # Return PyTorch tensors
)

print(f"Batch shape: {batch_encoded['input_ids'].shape}")
print(f"\nInput IDs:")
print(batch_encoded['input_ids'])
print(f"\nAttention Mask (0 = padding):")
print(batch_encoded['attention_mask'])

Batch shape: torch.Size([3, 17])

Input IDs:
tensor([[  101,  2460,  3793,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0],
        [  101,  2023,  2003,  1037,  5396,  3091,  6251,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0],
        [  101,  2023,  2003,  1037,  2172,  2936,  6251,  2008,  3397,  2116,
          2062,  2616,  1998, 19204,  2015,  1012,   102]])

Attention Mask (0 = padding):
tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


---

## 🎯 Student Challenge

Now it's your turn! Complete the following exercises:

### Challenge 1: Compare Tokenizers
Load tokenizers for `bert-base-uncased` and `gpt2`, then compare how they tokenize the same sentence.

In [None]:
# Solution: Compare Tokenizers

# 1. Load both tokenizers
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")

# 2. Tokenize the test sentence
test_sentence = "The transformer architecture revolutionized natural language processing."

bert_tokens = bert_tokenizer.tokenize(test_sentence)
gpt2_tokens = gpt2_tokenizer.tokenize(test_sentence)

# 3. Print the tokens and token counts for each
print("=" * 60)
print("BERT Tokenizer (WordPiece)")
print("=" * 60)
print(f"Tokens: {bert_tokens}")
print(f"Token count: {len(bert_tokens)}")
print()

print("=" * 60)
print("GPT-2 Tokenizer (BPE)")
print("=" * 60)
print(f"Tokens: {gpt2_tokens}")
print(f"Token count: {len(gpt2_tokens)}")
print()

# Bonus: Show the difference in tokenization strategies
print("=" * 60)
print("Key Observations")
print("=" * 60)
print("• BERT uses WordPiece tokenization (## prefix for subwords)")
print("• GPT-2 uses Byte-Pair Encoding (Ġ prefix for space before token)")
print(f"• BERT produced {len(bert_tokens)} tokens")
print(f"• GPT-2 produced {len(gpt2_tokens)} tokens")


### Challenge 2: Model Size Comparison
Load `distilbert-base-uncased` and `bert-base-uncased`, then compare their parameter counts.

In [None]:
# Solution: Model Size Comparison

# 1. Load both models
bert_model = AutoModel.from_pretrained("bert-base-uncased")
distilbert_model = AutoModel.from_pretrained("distilbert-base-uncased")

# 2. Count parameters for each
bert_params = sum(p.numel() for p in bert_model.parameters())
distilbert_params = sum(p.numel() for p in distilbert_model.parameters())

# 3. Calculate the size reduction percentage
reduction = (bert_params - distilbert_params) / bert_params * 100

print("=" * 60)
print("Model Size Comparison")
print("=" * 60)
print(f"BERT-base parameters:       {bert_params:>15,}")
print(f"DistilBERT parameters:      {distilbert_params:>15,}")
print(f"Parameter reduction:        {reduction:>14.1f}%")
print()

# Bonus: Compare model configurations
print("=" * 60)
print("Model Architecture Comparison")
print("=" * 60)
print(f"{'Attribute':<25} {'BERT':>12} {'DistilBERT':>12}")
print("-" * 50)
print(f"{'Hidden Size':<25} {bert_model.config.hidden_size:>12} {distilbert_model.config.hidden_size:>12}")
print(f"{'Hidden Layers':<25} {bert_model.config.num_hidden_layers:>12} {distilbert_model.config.num_hidden_layers:>12}")
print(f"{'Attention Heads':<25} {bert_model.config.num_attention_heads:>12} {distilbert_model.config.num_attention_heads:>12}")
print()
print("Key Insight: DistilBERT has half the layers (6 vs 12) while")
print("retaining 97% of BERT's language understanding capabilities.")


---

## Key Takeaways

1. **Tokenization** converts text to numerical IDs that models can process
2. **Auto classes** automatically detect the right model architecture
3. **Task-specific models** add appropriate heads for classification, generation, etc.
4. **Pipelines** provide a high-level API for quick prototyping
5. **Batch processing** with padding improves efficiency

---

## Next Steps

Continue to `02_model_architecture.ipynb` to learn about:
- Encoder vs. Decoder architectures
- Attention mechanism visualization
- Memory and compute requirements