# 11 - Modern NLP with Transformers

## Learning Objectives

By the end of this notebook, you will:

1. **Understand pretrained language models** - BERT, GPT, and their variants
2. **Use HuggingFace Transformers** - Load and use pretrained models
3. **Fine-tune for downstream tasks** - Classification, NER, question answering
4. **Master tokenization** - BPE, WordPiece, and handling special tokens
5. **Apply efficient fine-tuning techniques** - LoRA, freezing layers, learning rate scheduling

---

## Setup

In [None]:
# Install required packages (run once)
# !pip install transformers datasets tokenizers accelerate

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# HuggingFace imports
from transformers import (
    AutoModel, AutoModelForSequenceClassification, AutoModelForTokenClassification,
    AutoModelForQuestionAnswering, AutoModelForCausalLM,
    AutoTokenizer, AutoConfig,
    TrainingArguments, Trainer,
    get_linear_schedule_with_warmup,
    DataCollatorWithPadding
)
from datasets import load_dataset

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

torch.manual_seed(42)

---

## 1. Understanding Pretrained Language Models

### 1.1 The Pretraining Paradigm

Modern NLP follows a two-stage approach:
1. **Pretraining**: Learn general language understanding from massive unlabeled text
2. **Fine-tuning**: Adapt to specific downstream tasks with labeled data

Key model families:
- **Encoder-only (BERT)**: Bidirectional, good for understanding tasks
- **Decoder-only (GPT)**: Autoregressive, good for generation
- **Encoder-Decoder (T5, BART)**: Good for seq2seq tasks

In [None]:
# Compare model architectures

model_info = {
    'bert-base-uncased': {
        'type': 'Encoder-only',
        'params': '110M',
        'pretraining': 'MLM + NSP',
        'use_cases': 'Classification, NER, QA'
    },
    'gpt2': {
        'type': 'Decoder-only', 
        'params': '124M',
        'pretraining': 'Causal LM',
        'use_cases': 'Text generation'
    },
    'roberta-base': {
        'type': 'Encoder-only',
        'params': '125M', 
        'pretraining': 'MLM (no NSP)',
        'use_cases': 'Same as BERT, often better'
    },
    'distilbert-base-uncased': {
        'type': 'Encoder-only',
        'params': '66M',
        'pretraining': 'Distillation from BERT',
        'use_cases': 'Same as BERT, faster'
    }
}

print("Popular Pretrained Models:\n")
for model_name, info in model_info.items():
    print(f"{model_name}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

### 1.2 BERT: Bidirectional Encoder Representations from Transformers

In [None]:
# Load BERT model and tokenizer

model_name = 'bert-base-uncased'

# Tokenizer converts text to token IDs
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Base model (without task-specific head)
model = AutoModel.from_pretrained(model_name)

print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Model config: {model.config.hidden_size}d, {model.config.num_hidden_layers} layers, {model.config.num_attention_heads} heads")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Understand BERT's input format

text = "Hello, how are you doing today?"

# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Original text: {text}")
print(f"Tokens: {tokens}")

# Convert to IDs with special tokens
encoded = tokenizer(text, return_tensors='pt')
print(f"\nEncoded:")
print(f"  input_ids: {encoded['input_ids']}")
print(f"  token_type_ids: {encoded['token_type_ids']}")
print(f"  attention_mask: {encoded['attention_mask']}")

# Decode back
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"\nDecoded: {decoded}")

In [None]:
# BERT special tokens

print("Special tokens:")
print(f"  [CLS] (classification): {tokenizer.cls_token} -> {tokenizer.cls_token_id}")
print(f"  [SEP] (separator): {tokenizer.sep_token} -> {tokenizer.sep_token_id}")
print(f"  [PAD] (padding): {tokenizer.pad_token} -> {tokenizer.pad_token_id}")
print(f"  [MASK] (masking): {tokenizer.mask_token} -> {tokenizer.mask_token_id}")
print(f"  [UNK] (unknown): {tokenizer.unk_token} -> {tokenizer.unk_token_id}")

In [None]:
# Get BERT embeddings

model.eval()
model.to(device)

text = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(text, return_tensors='pt').to(device)

with torch.no_grad():
    outputs = model(**inputs)

# BERT outputs:
# - last_hidden_state: (batch, seq_len, hidden_size) - token representations
# - pooler_output: (batch, hidden_size) - [CLS] token passed through pooling layer

print(f"Last hidden state shape: {outputs.last_hidden_state.shape}")
print(f"Pooler output shape: {outputs.pooler_output.shape}")

# [CLS] token representation (often used for classification)
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"\n[CLS] embedding shape: {cls_embedding.shape}")

---

## 2. Tokenization Deep Dive

### 2.1 Subword Tokenization

In [None]:
# Different tokenization strategies

# BERT uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# GPT-2 uses BPE (Byte Pair Encoding)
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')

# RoBERTa also uses BPE
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')

test_text = "I'm learning about transformers and tokenization!"

print(f"Text: {test_text}\n")
print(f"BERT (WordPiece): {bert_tokenizer.tokenize(test_text)}")
print(f"GPT-2 (BPE): {gpt2_tokenizer.tokenize(test_text)}")
print(f"RoBERTa (BPE): {roberta_tokenizer.tokenize(test_text)}")

In [None]:
# How subword tokenization handles unknown words

unusual_text = "The transformerification of NLP is antidisestablishmentarianism."

print(f"Unusual text: {unusual_text}\n")
print(f"BERT tokens: {bert_tokenizer.tokenize(unusual_text)}")

# Subword tokenization can represent ANY word as a sequence of subwords
# This means no [UNK] tokens for out-of-vocabulary words

In [None]:
# Handling long sequences

long_text = "This is a very long text. " * 100

# Most models have a maximum sequence length
print(f"BERT max length: {bert_tokenizer.model_max_length}")

# Tokenizer can truncate and pad
encoded = bert_tokenizer(
    long_text,
    max_length=128,
    truncation=True,
    padding='max_length',
    return_tensors='pt'
)

print(f"Encoded length: {encoded['input_ids'].shape[1]}")

In [None]:
# Batch tokenization with padding

texts = [
    "Short text.",
    "This is a medium-length piece of text.",
    "And this one is even longer, containing many more words and tokens to process."
]

# Pad to longest in batch
batch_encoded = bert_tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

print(f"Batch shape: {batch_encoded['input_ids'].shape}")
print(f"\nAttention masks (1=real token, 0=padding):")
for i, mask in enumerate(batch_encoded['attention_mask']):
    print(f"  Text {i}: {mask.sum().item()} real tokens")

---

## 3. Text Classification with Fine-tuning

### 3.1 Loading a Dataset

In [None]:
# Load IMDb sentiment classification dataset

dataset = load_dataset('imdb')

print(f"Dataset structure:")
print(dataset)

print(f"\nExample:")
print(f"Text: {dataset['train'][0]['text'][:200]}...")
print(f"Label: {dataset['train'][0]['label']} (0=negative, 1=positive)")

In [None]:
# Tokenize the dataset

model_name = 'distilbert-base-uncased'  # Smaller model for faster training
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256
    )

# Use a small subset for demonstration
small_train = dataset['train'].shuffle(seed=42).select(range(1000))
small_test = dataset['test'].shuffle(seed=42).select(range(200))

tokenized_train = small_train.map(tokenize_function, batched=True)
tokenized_test = small_test.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print(f"Tokenized train sample keys: {tokenized_train[0].keys()}")

### 3.2 Fine-tuning with PyTorch

In [None]:
# Load model for sequence classification

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)
model.to(device)

print(f"Model architecture:")
print(f"  Base model: DistilBERT")
print(f"  Classification head: Linear({model.config.hidden_size} -> 2)")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Manual training loop

train_loader = DataLoader(tokenized_train, batch_size=16, shuffle=True)
test_loader = DataLoader(tokenized_test, batch_size=32)

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

# Learning rate schedule with warmup
num_epochs = 3
num_training_steps = len(train_loader) * num_epochs
num_warmup_steps = num_training_steps // 10

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

print(f"Training for {num_epochs} epochs")
print(f"Total steps: {num_training_steps}, warmup steps: {num_warmup_steps}")

In [None]:
# Training loop

def train_epoch(model, dataloader, optimizer, scheduler):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch in dataloader:
        optimizer.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        loss = outputs.loss
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
        preds = outputs.logits.argmax(dim=-1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total


@torch.no_grad()
def evaluate(model, dataloader):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        total_loss += outputs.loss.item()
        preds = outputs.logits.argmax(dim=-1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total


# Train
for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, scheduler)
    test_loss, test_acc = evaluate(model, test_loader)
    
    print(f"Epoch {epoch+1}/{num_epochs}:")
    print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"  Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}")

In [None]:
# Test on new examples

@torch.no_grad()
def predict_sentiment(texts: List[str]):
    model.eval()
    
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors='pt'
    ).to(device)
    
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=-1)
    preds = probs.argmax(dim=-1)
    
    labels = ['Negative', 'Positive']
    
    for text, pred, prob in zip(texts, preds, probs):
        print(f"Text: {text[:50]}...")
        print(f"Prediction: {labels[pred]} (confidence: {prob[pred]:.3f})\n")


test_texts = [
    "This movie was absolutely fantastic! Best film I've seen all year.",
    "What a waste of time. Terrible acting and boring plot.",
    "It was okay, nothing special but not bad either."
]

predict_sentiment(test_texts)

### 3.3 Using HuggingFace Trainer

In [None]:
# The Trainer API makes fine-tuning easier

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions)
    }


# Reload fresh model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_ratio=0.1,
    weight_decay=0.01,
    learning_rate=2e-5,
    logging_dir='./logs',
    logging_steps=50,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics
)

print("Trainer configured! Call trainer.train() to start training.")

---

## 4. Named Entity Recognition (Token Classification)

In [None]:
# NER model: classify each token

ner_model_name = 'dslim/bert-base-NER'
ner_tokenizer = AutoTokenizer.from_pretrained(ner_model_name)
ner_model = AutoModelForTokenClassification.from_pretrained(ner_model_name)
ner_model.to(device)
ner_model.eval()

# NER labels
id2label = ner_model.config.id2label
print(f"NER labels: {id2label}")

In [None]:
# Perform NER

@torch.no_grad()
def extract_entities(text: str) -> List[Dict]:
    """Extract named entities from text"""
    
    inputs = ner_tokenizer(
        text,
        return_tensors='pt',
        return_offsets_mapping=True
    )
    
    offset_mapping = inputs.pop('offset_mapping')[0]
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    outputs = ner_model(**inputs)
    predictions = outputs.logits.argmax(dim=-1)[0]
    
    entities = []
    current_entity = None
    
    for idx, (pred, offset) in enumerate(zip(predictions, offset_mapping)):
        label = id2label[pred.item()]
        
        if label == 'O':
            if current_entity:
                entities.append(current_entity)
                current_entity = None
        elif label.startswith('B-'):
            if current_entity:
                entities.append(current_entity)
            current_entity = {
                'type': label[2:],
                'start': offset[0].item(),
                'end': offset[1].item(),
                'text': text[offset[0]:offset[1]]
            }
        elif label.startswith('I-') and current_entity:
            if label[2:] == current_entity['type']:
                current_entity['end'] = offset[1].item()
                current_entity['text'] = text[current_entity['start']:current_entity['end']]
    
    if current_entity:
        entities.append(current_entity)
    
    return entities


# Test
test_text = "Apple CEO Tim Cook announced a new iPhone at the event in San Francisco."
entities = extract_entities(test_text)

print(f"Text: {test_text}\n")
print("Entities found:")
for entity in entities:
    print(f"  {entity['text']} ({entity['type']})")

---

## 5. Question Answering

In [None]:
# Extractive QA: find answer span in context

qa_model_name = 'distilbert-base-cased-distilled-squad'
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
qa_model.to(device)
qa_model.eval()

print(f"QA model loaded: {qa_model_name}")

In [None]:
@torch.no_grad()
def answer_question(question: str, context: str) -> Dict:
    """Extract answer from context given a question"""
    
    inputs = qa_tokenizer(
        question,
        context,
        return_tensors='pt',
        max_length=512,
        truncation=True
    ).to(device)
    
    outputs = qa_model(**inputs)
    
    # Get most likely start and end positions
    start_scores = outputs.start_logits[0]
    end_scores = outputs.end_logits[0]
    
    start_idx = start_scores.argmax().item()
    end_idx = end_scores.argmax().item()
    
    # Ensure valid span
    if end_idx < start_idx:
        end_idx = start_idx
    
    # Decode answer
    answer_tokens = inputs['input_ids'][0][start_idx:end_idx+1]
    answer = qa_tokenizer.decode(answer_tokens)
    
    # Confidence
    start_prob = F.softmax(start_scores, dim=-1)[start_idx].item()
    end_prob = F.softmax(end_scores, dim=-1)[end_idx].item()
    confidence = (start_prob + end_prob) / 2
    
    return {
        'answer': answer,
        'confidence': confidence,
        'start': start_idx,
        'end': end_idx
    }


# Test
context = """
PyTorch is an open source machine learning framework based on the Torch library, 
used for applications such as computer vision and natural language processing. 
It was primarily developed by Meta AI (formerly Facebook's AI Research lab). 
PyTorch was released in October 2016 and has become one of the most popular 
deep learning frameworks alongside TensorFlow.
"""

questions = [
    "Who developed PyTorch?",
    "When was PyTorch released?",
    "What is PyTorch used for?"
]

print(f"Context: {context[:100]}...\n")
for q in questions:
    result = answer_question(q, context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} (confidence: {result['confidence']:.3f})\n")

---

## 6. Text Generation with GPT

### 6.1 Basic Generation

In [None]:
# Load GPT-2

gpt_model_name = 'gpt2'
gpt_tokenizer = AutoTokenizer.from_pretrained(gpt_model_name)
gpt_model = AutoModelForCausalLM.from_pretrained(gpt_model_name)
gpt_model.to(device)
gpt_model.eval()

# GPT-2 doesn't have a pad token by default
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

print(f"GPT-2 loaded: {sum(p.numel() for p in gpt_model.parameters()):,} parameters")

In [None]:
@torch.no_grad()
def generate_text(
    prompt: str,
    max_new_tokens: int = 50,
    temperature: float = 0.7,
    top_k: int = 50,
    top_p: float = 0.95,
    do_sample: bool = True
) -> str:
    """Generate text continuation"""
    
    inputs = gpt_tokenizer(prompt, return_tensors='pt').to(device)
    
    outputs = gpt_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=do_sample,
        pad_token_id=gpt_tokenizer.eos_token_id
    )
    
    return gpt_tokenizer.decode(outputs[0], skip_special_tokens=True)


# Test generation
prompts = [
    "The future of artificial intelligence is",
    "In a world where machines can think,",
    "Once upon a time, in a land far away,"
]

for prompt in prompts:
    print(f"Prompt: {prompt}")
    generated = generate_text(prompt, max_new_tokens=40)
    print(f"Generated: {generated}")
    print("-" * 50)

In [None]:
# Compare different decoding strategies

prompt = "The key to success in machine learning is"

print(f"Prompt: {prompt}\n")

# Greedy decoding (temperature=0, no sampling)
print("1. Greedy decoding:")
print(generate_text(prompt, do_sample=False, max_new_tokens=30))

# Low temperature (more focused)
print("\n2. Low temperature (0.3):")
print(generate_text(prompt, temperature=0.3, max_new_tokens=30))

# High temperature (more creative)
print("\n3. High temperature (1.0):")
print(generate_text(prompt, temperature=1.0, max_new_tokens=30))

# Top-p sampling
print("\n4. Top-p (nucleus) sampling:")
print(generate_text(prompt, top_p=0.9, top_k=0, max_new_tokens=30))

---

## 7. Efficient Fine-tuning Techniques

### 7.1 Freezing Layers

In [None]:
# Freeze base model, only train classification head

def freeze_base_model(model):
    """Freeze all parameters except the classification head"""
    for name, param in model.named_parameters():
        if 'classifier' not in name and 'pooler' not in name:
            param.requires_grad = False
    
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")


# Load fresh model
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2
)

print("Before freezing:")
print(f"  Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

print("\nAfter freezing:")
freeze_base_model(model)

In [None]:
# Gradual unfreezing

def unfreeze_layers(model, num_layers: int):
    """Unfreeze the top N transformer layers"""
    # First freeze everything
    for param in model.parameters():
        param.requires_grad = False
    
    # Unfreeze classifier
    for param in model.classifier.parameters():
        param.requires_grad = True
    
    # Unfreeze top N encoder layers
    total_layers = model.config.num_hidden_layers
    for i in range(total_layers - num_layers, total_layers):
        for param in model.distilbert.transformer.layer[i].parameters():
            param.requires_grad = True
    
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Unfroze top {num_layers} layers: {trainable:,} / {total:,} trainable ({100*trainable/total:.2f}%)")


# Example
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
unfreeze_layers(model, num_layers=2)

### 7.2 LoRA (Low-Rank Adaptation)

In [None]:
# LoRA: Add small trainable matrices to frozen weights

class LoRALayer(nn.Module):
    """
    LoRA adapter layer.
    
    Instead of training W, we train W + A @ B where A and B are low-rank.
    This dramatically reduces trainable parameters.
    """
    
    def __init__(self, original_layer: nn.Linear, rank: int = 8, alpha: float = 16):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # Low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) / rank)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        
        # Freeze original weights
        self.original_layer.weight.requires_grad = False
        if self.original_layer.bias is not None:
            self.original_layer.bias.requires_grad = False
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original output + LoRA delta
        original_output = self.original_layer(x)
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        return original_output + lora_output


# Example
original = nn.Linear(768, 768)
lora = LoRALayer(original, rank=8)

original_params = sum(p.numel() for p in original.parameters())
lora_params = lora.lora_A.numel() + lora.lora_B.numel()

print(f"Original layer parameters: {original_params:,}")
print(f"LoRA parameters: {lora_params:,}")
print(f"Reduction: {100 * (1 - lora_params/original_params):.2f}%")

In [None]:
# Apply LoRA to attention layers

def add_lora_to_model(model, rank: int = 8, alpha: float = 16):
    """
    Add LoRA adapters to query and value projections.
    """
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear) and ('query' in name or 'value' in name):
            # Replace with LoRA version
            parent_name = '.'.join(name.split('.')[:-1])
            child_name = name.split('.')[-1]
            
            parent = model
            for attr in parent_name.split('.'):
                if attr:
                    parent = getattr(parent, attr)
            
            setattr(parent, child_name, LoRALayer(module, rank, alpha))
    
    # Count parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"After LoRA: {trainable:,} / {total:,} trainable ({100*trainable/total:.2f}%)")
    
    return model


print("LoRA implementation complete!")
print("For production use, consider using the peft library from HuggingFace.")

---

## Exercises

### Exercise 1: Multi-label Classification

Modify the classification model to handle multi-label classification (multiple labels per example).

In [None]:
# Exercise 1: Multi-label classification

class MultiLabelClassifier(nn.Module):
    """
    BERT-based multi-label classifier.
    
    Each example can have multiple labels (e.g., tags, categories).
    
    Key changes from single-label:
    1. Use BCEWithLogitsLoss instead of CrossEntropyLoss
    2. Use sigmoid instead of softmax for predictions
    3. Labels are multi-hot encoded
    """
    
    def __init__(self, model_name: str, num_labels: int):
        super().__init__()
        # YOUR CODE HERE
        pass
    
    def forward(self, input_ids, attention_mask, labels=None):
        # YOUR CODE HERE
        pass

### Exercise 2: Implement Contrastive Learning for Sentence Embeddings

Create a model that learns good sentence embeddings using contrastive loss.

In [None]:
# Exercise 2: Contrastive sentence embeddings

class ContrastiveSentenceEncoder(nn.Module):
    """
    Learn sentence embeddings using contrastive learning.
    
    Similar sentences should have similar embeddings.
    Uses InfoNCE loss (similar to SimCLR, CLIP).
    """
    
    def __init__(self, model_name: str, embedding_dim: int = 256):
        super().__init__()
        # YOUR CODE HERE
        # Hint: Use mean pooling over tokens for sentence embedding
        pass
    
    def encode(self, input_ids, attention_mask) -> torch.Tensor:
        """Get sentence embeddings"""
        # YOUR CODE HERE
        pass
    
    def forward(self, anchor_ids, anchor_mask, positive_ids, positive_mask):
        """Compute contrastive loss"""
        # YOUR CODE HERE
        pass

### Exercise 3: Build a Simple Chatbot

Create a simple chatbot using GPT-2 with context management.

In [None]:
# Exercise 3: Simple chatbot

class SimpleChatbot:
    """
    A simple chatbot using GPT-2.
    
    Features:
    - Maintains conversation history
    - Handles context length limits
    - Uses appropriate generation parameters
    """
    
    def __init__(self, model_name: str = 'gpt2', max_context_length: int = 512):
        # YOUR CODE HERE
        pass
    
    def respond(self, user_message: str) -> str:
        """Generate a response to the user message"""
        # YOUR CODE HERE
        # Hint: Format as "User: ... \nAssistant: ..."
        pass
    
    def clear_history(self):
        """Clear conversation history"""
        pass

---

## Solutions

In [None]:
# Solution 1: Multi-label classification

class MultiLabelClassifier(nn.Module):
    def __init__(self, model_name: str, num_labels: int):
        super().__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        self.num_labels = num_labels
    
    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        # Use [CLS] token
        pooled = outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(pooled)
        
        loss = None
        if labels is not None:
            # BCEWithLogitsLoss for multi-label
            loss_fn = nn.BCEWithLogitsLoss()
            loss = loss_fn(logits, labels.float())
        
        return {'loss': loss, 'logits': logits}
    
    def predict(self, input_ids, attention_mask, threshold: float = 0.5):
        with torch.no_grad():
            outputs = self.forward(input_ids, attention_mask)
            probs = torch.sigmoid(outputs['logits'])
            return (probs > threshold).int()


# Test
multi_label_model = MultiLabelClassifier('distilbert-base-uncased', num_labels=5)
print(f"Multi-label classifier created with {sum(p.numel() for p in multi_label_model.parameters()):,} parameters")

In [None]:
# Solution 2: Contrastive sentence encoder

class ContrastiveSentenceEncoder(nn.Module):
    def __init__(self, model_name: str, embedding_dim: int = 256):
        super().__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.projection = nn.Sequential(
            nn.Linear(self.bert.config.hidden_size, self.bert.config.hidden_size),
            nn.ReLU(),
            nn.Linear(self.bert.config.hidden_size, embedding_dim)
        )
        self.temperature = 0.07
    
    def mean_pooling(self, token_embeddings, attention_mask):
        """Mean pooling over token embeddings"""
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
        return sum_embeddings / sum_mask
    
    def encode(self, input_ids, attention_mask) -> torch.Tensor:
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled = self.mean_pooling(outputs.last_hidden_state, attention_mask)
        embeddings = self.projection(pooled)
        return F.normalize(embeddings, p=2, dim=1)
    
    def forward(self, anchor_ids, anchor_mask, positive_ids, positive_mask):
        # Get embeddings
        anchor_emb = self.encode(anchor_ids, anchor_mask)
        positive_emb = self.encode(positive_ids, positive_mask)
        
        # InfoNCE loss
        batch_size = anchor_emb.size(0)
        
        # Similarity matrix
        sim_matrix = torch.matmul(anchor_emb, positive_emb.T) / self.temperature
        
        # Labels: positive pairs are on diagonal
        labels = torch.arange(batch_size, device=anchor_emb.device)
        
        # Cross entropy loss (both directions)
        loss = (F.cross_entropy(sim_matrix, labels) + F.cross_entropy(sim_matrix.T, labels)) / 2
        
        return {'loss': loss, 'anchor_emb': anchor_emb, 'positive_emb': positive_emb}


# Test
contrastive_model = ContrastiveSentenceEncoder('distilbert-base-uncased')
print(f"Contrastive encoder created!")

In [None]:
# Solution 3: Simple chatbot

class SimpleChatbot:
    def __init__(self, model_name: str = 'gpt2', max_context_length: int = 512):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()
        
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.max_context_length = max_context_length
        self.history = []
        
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
    
    def _build_prompt(self, user_message: str) -> str:
        """Build prompt with conversation history"""
        prompt = ""
        for turn in self.history:
            prompt += f"User: {turn['user']}\nAssistant: {turn['assistant']}\n"
        prompt += f"User: {user_message}\nAssistant:"
        return prompt
    
    def _truncate_history(self, prompt: str) -> str:
        """Remove old history if prompt is too long"""
        tokens = self.tokenizer.encode(prompt)
        while len(tokens) > self.max_context_length - 50 and self.history:
            self.history.pop(0)
            prompt = self._build_prompt(prompt.split("User: ")[-1].split("\n")[0])
            tokens = self.tokenizer.encode(prompt)
        return prompt
    
    @torch.no_grad()
    def respond(self, user_message: str) -> str:
        prompt = self._build_prompt(user_message)
        prompt = self._truncate_history(prompt)
        
        inputs = self.tokenizer(prompt, return_tensors='pt').to(self.device)
        
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id,
            eos_token_id=self.tokenizer.encode('\n')[0]  # Stop at newline
        )
        
        full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        assistant_response = full_response[len(prompt):].strip()
        
        # Clean up response
        if 'User:' in assistant_response:
            assistant_response = assistant_response.split('User:')[0].strip()
        
        # Add to history
        self.history.append({
            'user': user_message,
            'assistant': assistant_response
        })
        
        return assistant_response
    
    def clear_history(self):
        self.history = []


# Test
chatbot = SimpleChatbot()
print("Chatbot: Hello! How can I help you today?")
print(f"User: What is machine learning?")
print(f"Chatbot: {chatbot.respond('What is machine learning?')}")

---

## Summary

### Key Takeaways

1. **Pretrained Models**:
   - BERT (encoder): bidirectional, good for understanding
   - GPT (decoder): causal, good for generation
   - Use the right model for your task

2. **Tokenization**:
   - Subword tokenization (BPE, WordPiece)
   - Handle special tokens ([CLS], [SEP], [PAD])
   - Attention mask indicates real vs. padding tokens

3. **Fine-tuning**:
   - Task-specific heads on top of pretrained models
   - Lower learning rates (2e-5 to 5e-5)
   - Warmup + linear decay schedule

4. **Efficient Fine-tuning**:
   - Freeze base model, train only head
   - Gradual unfreezing (top layers first)
   - LoRA: low-rank adaptation

5. **HuggingFace Ecosystem**:
   - `transformers`: models and tokenizers
   - `datasets`: data loading
   - `Trainer`: simplified training

### Common Tasks

| Task | Model Type | Output |
|------|------------|--------|
| Classification | AutoModelForSequenceClassification | Logits per class |
| NER | AutoModelForTokenClassification | Logits per token |
| QA | AutoModelForQuestionAnswering | Start/end positions |
| Generation | AutoModelForCausalLM | Next token logits |