# Module 15: BERT & Encoder Models

**Bidirectional Encoder Representations from Transformers**

---

## 1. Objectives

- ‚úÖ Understand BERT architecture
- ‚úÖ Know pretraining objectives (MLM, NSP)
- ‚úÖ Fine-tune BERT with HuggingFace
- ‚úÖ Know BERT variants (RoBERTa, DistilBERT, ALBERT)

## 2. Prerequisites

- [Module 14: Transformer Architecture](../14_transformer_architecture/14_transformer_architecture.ipynb)

## 3. BERT Architecture

### Key Insight
BERT = **Encoder-only** Transformer, trained **bidirectionally**

```
[CLS] The cat sat [MASK] the mat [SEP]
  ‚Üì    ‚Üì   ‚Üì   ‚Üì    ‚Üì     ‚Üì   ‚Üì   ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         Transformer Encoder         ‚îÇ
‚îÇ        (12 or 24 layers)            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
  ‚Üì    ‚Üì   ‚Üì   ‚Üì    ‚Üì     ‚Üì   ‚Üì   ‚Üì
 cls  h1  h2  h3   h4    h5  h6  sep
```

### Model Sizes

| Model | Layers | Hidden | Heads | Params |
|-------|--------|--------|-------|--------|
| BERT-base | 12 | 768 | 12 | 110M |
| BERT-large | 24 | 1024 | 16 | 340M |

In [None]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModel

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

## 4. Pretraining Objectives

### Masked Language Modeling (MLM)
```
Input:  "The cat [MASK] on the mat"
Output: Predict "sat" at [MASK] position
```
- Mask 15% of tokens randomly
- 80% replace with [MASK], 10% random word, 10% keep

### Next Sentence Prediction (NSP)
```
[CLS] Sentence A [SEP] Sentence B [SEP]
‚Üí Predict: Is B the next sentence after A?
```
- Binary classification on [CLS] token
- 50% real next sentence, 50% random

## 5. Using BERT with HuggingFace

In [None]:
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize
text = "The cat sat on the mat."
inputs = tokenizer(text, return_tensors='pt')

print("Tokenized:")
print(f"  input_ids: {inputs['input_ids']}")
print(f"  tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

print(f"\nOutputs:")
print(f"  last_hidden_state: {outputs.last_hidden_state.shape}")
print(f"  pooler_output ([CLS]): {outputs.pooler_output.shape}")

## 6. BERT for Classification

In [None]:
# Method 1: Use built-in classification head
classifier = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=2
)

inputs = tokenizer("This movie is great!", return_tensors='pt')
with torch.no_grad():
    outputs = classifier(**inputs)

print(f"Logits: {outputs.logits}")
print(f"Prediction: {'Positive' if outputs.logits.argmax() == 1 else 'Negative'}")

In [None]:
# Method 2: Custom classification head
class BertClassifier(nn.Module):
    def __init__(self, num_classes=2, dropout=0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(768, num_classes)  # 768 = BERT hidden size
    
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.pooler_output  # [CLS] token
        return self.fc(self.dropout(cls_output))

# Test
model = BertClassifier(num_classes=3)
inputs = tokenizer("Hello world", return_tensors='pt')
logits = model(inputs['input_ids'], inputs['attention_mask'])
print(f"Custom classifier output: {logits.shape}")

## 7. Fine-tuning Best Practices

In [None]:
# Fine-tuning hyperparameters
from transformers import AdamW, get_linear_schedule_with_warmup

# Typical settings
learning_rate = 2e-5  # Much smaller than training from scratch!
epochs = 3
batch_size = 16
warmup_steps = 500

# Optimizer with weight decay
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)

# Learning rate scheduler
total_steps = 1000  # Depends on dataset size
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

print("Fine-tuning setup complete!")

## 8. BERT Variants

| Model | Key Change | Use Case |
|-------|------------|----------|
| **RoBERTa** | No NSP, more data, dynamic masking | Better performance |
| **DistilBERT** | 6 layers, knowledge distillation | 60% smaller, 2x faster |
| **ALBERT** | Parameter sharing, factorization | Much smaller |
| **ELECTRA** | Replaced token detection | More efficient pretraining |

In [None]:
# Using different models
models_to_try = [
    'bert-base-uncased',
    'roberta-base',
    'distilbert-base-uncased',
]

for model_name in models_to_try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    params = sum(p.numel() for p in model.parameters()) / 1e6
    print(f"{model_name}: {params:.1f}M parameters")

## 9. üî• Real-World Usage

### When to Use BERT

| Task | BERT Variant |
|------|-------------|
| Classification | BERT + [CLS] |
| NER/Tagging | BERT + token outputs |
| QA | BERT + start/end heads |
| Sentence Similarity | Sentence-BERT |
| Production (fast) | DistilBERT |

### 2024 Landscape
- BERT still widely used for **classification**
- For generation ‚Üí use GPT or T5
- For embeddings ‚Üí consider newer models

## 10. Interview Questions

**Q1: How does BERT differ from GPT?**
<details><summary>Answer</summary>

- **BERT**: Encoder-only, bidirectional, MLM pretraining, good for understanding
- **GPT**: Decoder-only, left-to-right, LM pretraining, good for generation
</details>

**Q2: What is the [CLS] token for?**
<details><summary>Answer</summary>

Special token at the start. Its final hidden state is used as aggregate sequence representation for classification tasks. Trained via NSP objective.
</details>

**Q3: Why fine-tune with low learning rate?**
<details><summary>Answer</summary>

Pretrained weights are already good. High LR would destroy them. Typical: 2e-5 to 5e-5. Also use warmup to avoid early instability.
</details>

## 11. Summary

- **BERT**: Bidirectional encoder, pretrained on MLM + NSP
- **[CLS] token**: Sequence-level representation
- **Fine-tuning**: Low LR (2e-5), warmup, 2-4 epochs
- **Variants**: RoBERTa (better), DistilBERT (faster)

## 12. References

- [BERT Paper](https://arxiv.org/abs/1810.04805)
- [HuggingFace BERT](https://huggingface.co/docs/transformers/model_doc/bert)
- [The Illustrated BERT](https://jalammar.github.io/illustrated-bert/)

---
**Next:** [Module 16: GPT & Decoder Models](../16_gpt/16_gpt.ipynb)