# Lesson 4: Seq2Seq Models for Question Answering

## Overview
In this exercise, you'll build a seq2seq model that answers questions based on provided context. The model uses:
- **Encoder**: Bidirectional LSTM that compresses context into a fixed-size vector (THE BOTTLENECK)
- **Decoder**: Unidirectional LSTM that generates answers token-by-token
- **Teacher Forcing**: During training, the model uses ground truth tokens instead of predictions

The key insight: ALL context (whether 10 or 70 words) gets compressed into the same 256D vector. This creates a bottleneck that hurts performance on longer contexts. We'll measure this empirically!

## Setup: Import Required Libraries

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import random
from tqdm import tqdm

# TODO: Import from your modules
# from data import SyntheticQAGenerator, build_vocabulary, encode_text, decode_text
# from models import Encoder, Decoder, Seq2Seq, count_parameters

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## Part A: Data Generation and Vocabulary

### A.1: Generate Synthetic Q&A Data
Create synthetic question-answer pairs with controlled context lengths to clearly demonstrate the bottleneck effect.

In [None]:
# TODO: Generate synthetic Q&A data
# Hint: 
# 1. Create a SyntheticQAGenerator instance
# 2. Call generate_dataset() with counts for short (400), medium (400), and long (400) contexts
# 3. Print the total number of examples and inspect one sample to understand the data structure

### A.2: Build Vocabulary
Create a vocabulary from all contexts, questions, and answers.

In [None]:
# TODO: Build vocabulary from Q&A data
# Hint:
# 1. Use build_vocabulary() from your data module with the Q&A pairs
# 2. Create a reverse mapping (idx2word) so you can decode token indices back to words
# 3. Check the vocabulary size to understand how many unique words you have

### A.3: Create PyTorch Dataset
Encode Q&A pairs into token sequences.

In [None]:
class QADataset(Dataset):
    """Convert Q&A pairs to token sequences."""
    def __init__(self, qa_pairs, vocab, encode_fn):
        self.qa_pairs = qa_pairs
        self.vocab = vocab
        self.encode_fn = encode_fn
    
    def __len__(self):
        return len(self.qa_pairs)
    
    def __getitem__(self, idx):
        context, question, answer = self.qa_pairs[idx]
        
        # TODO: Encode the data
        # Hint: 
        # 1. Concatenate context and question with a separator token (e.g., "[SEP]")
        # 2. Use encode_text() to convert both source (context+question) and target (answer) to token indices
        # 3. Return the token sequences as torch tensors
        
        source_tokens = torch.zeros(20, dtype=torch.long)
        target_tokens = torch.zeros(5, dtype=torch.long)
        
        return source_tokens, target_tokens

# TODO: Split data and create DataLoaders
# Hint:
# 1. Split qa_data into train (70%), validation (20%), and test (10%) sets
# 2. Create QADataset instances for each split
# 3. Wrap each dataset in a DataLoader with appropriate batch size (try 32)

## Part B: Model Architecture

### B.1: Initialize Model
Create encoder, decoder, and seq2seq model. Make sure you've implemented `models.py` first!

In [None]:
# TODO: Initialize your seq2seq model
# Hint:
# 1. Set embedding_dim (e.g., 128) and hidden_dim (256 is THE BOTTLENECK)
# 2. Create Encoder, Decoder, and Seq2Seq instances using your vocabulary size
# 3. Move model to device and print parameter count to verify setup
# 4. Verify that the context vector size is 256D (understand why this is a bottleneck)

## Part C: Training

### C.1: Setup Optimizer and Loss

In [None]:
# TODO: Setup training components
# Hint:
# 1. Choose a loss function appropriate for classification (CrossEntropyLoss) and ignore padding tokens
# 2. Choose an optimizer (Adam is a good choice) with a reasonable learning rate
# 3. Understand what train_epoch() and evaluate() should do:
#    - train_epoch: Use teacher forcing during training (decoder sees ground truth tokens)
#    - evaluate: No teacher forcing during validation/test (decoder only sees its own predictions)

### C.2: Train Model

In [None]:
# TODO: Train the model
# Hint:
# 1. Run for ~30 epochs and track both training and validation losses
# 2. Decay the teacher forcing ratio over time (start high, decrease gradually)
# 3. Use train_epoch() for training and evaluate() for validation after each epoch
# 4. Print metrics periodically to monitor progress (every 5 epochs)

### C.3: Plot Training Curves

In [None]:
# TODO: Visualize training progress
# Hint:
# 1. Plot training and validation losses on the same graph
# 2. Use proper labels, titles, and gridlines for clarity
# 3. Observe: Does training loss decrease? Does validation loss converge or diverge?
# 4. Look for signs of overfitting (training goes down but validation goes up)

## Part D: Inference and Bottleneck Analysis

### D.1: Inference Function
Generate answers for new context-question pairs.

In [None]:
# TODO: Implement inference (answer generation)
# Hint:
# 1. Encode the context and question together (using your encode_text function)
# 2. Pass the encoded input through the encoder to get the context vector (the bottleneck!)
# 3. Use the context vector to seed the decoder
# 4. Autoregressively generate tokens one at a time:
#    - Take the current generated tokens as input
#    - Decoder outputs logits for the next token
#    - Select the token with highest probability (greedy decoding)
#    - Stop when you generate an <END> token or reach max_length
# 5. Decode the token indices back to text

# Test on some examples from the test set and compare predictions to ground truth

### D.2: Calculate Accuracy

In [None]:
# TODO: Calculate accuracy on test set
# Hint:
# 1. Iterate through the test set
# 2. Generate predictions using your inference function
# 3. Compare each prediction to the ground truth answer (exact match or case-insensitive)
# 4. Calculate the percentage of correct predictions
# 5. Print the overall test accuracy

### D.3: Bottleneck Analysis

**THE KEY INSIGHT**: The encoder compresses all contexts into the same 256D vector, regardless of context length. This should hurt performance significantly on longer contexts.

Let's measure this empirically!

In [None]:
# TODO: THE KEY EXERCISE - Analyze accuracy by context length
# Hint:
# 1. For each test example, collect: context_length (word count), prediction, ground_truth, is_correct
# 2. Bucket results into three categories:
#    - Short contexts: 8-12 words
#    - Medium contexts: 25-35 words
#    - Long contexts: 50-70 words
# 3. Calculate accuracy within each bucket
# 4. Plot a bar chart showing the three accuracies
# 5. OBSERVE: You should see 30-40% degradation from short to long
#    - Why? Because all contexts compress to THE SAME 256D vector!
# 6. This empirically demonstrates why attention mechanisms are needed (next lesson)

## Summary

### What We Learned

1. **The Bottleneck Problem**: The encoder compresses all contexts into a fixed 256D vector. Whether the context is 10 words or 70 words, the model has the same amount of information to work with. This creates a fundamental limitation.

2. **Empirical Evidence**: You should see ~30-40% accuracy degradation from short to long contexts:
   - Short contexts (8-12 words): ~88-95% accuracy
   - Medium contexts (25-35 words): ~75-80% accuracy  
   - Long contexts (50-70 words): ~50-60% accuracy

3. **Why Attention Matters**: The next lesson covers attention mechanisms, which allow the decoder to "look back" at the entire context during generation. Instead of compressing context into one vector, attention lets the model focus on relevant parts dynamically. This solves the bottleneck!

### Key Takeaways
- **Teacher Forcing**: Provides stable training by using ground truth tokens as input
- **Bidirectional Encoder**: Captures context from both directions
- **Context Vector**: The fixed-size bottleneck that limits model performance
- **Attention (Next)**: Dynamic context focusing to overcome the bottleneck

In Lesson 5, we'll add attention to eliminate this bottleneck and see performance improve significantly!