# IDS 576: Assignment 3 (10 points)

## Learning Objectives

By completing this assignment, you will:

- **Understand language modeling fundamentals**: Compare classical n-gram approaches with neural language models and analyze their trade-offs in terms of computational cost, memory requirements, and generation quality.

- **Implement and train RNN-based models**: Gain hands-on experience building LSTM-based language models, understanding how to adapt sequence-to-sequence architectures for different NLP tasks.

- **Build sequence-to-sequence translation systems**: Implement encoder-decoder architectures for machine translation, incorporating pre-trained word embeddings (GloVe, FastText) to improve model performance.

- **Evaluate NLP models using standard metrics**: Apply perplexity for language model evaluation and BLEU score for translation quality assessment, understanding what these metrics capture about model performance.

- **Visualize and interpret attention mechanisms**: Analyze how attention weights distribute across input sequences and interpret what the model learns to focus on during translation.

---

## Submission Guidelines

- Turn in solutions as a single notebook (ipynb) and as a pdf on the submission site. No need to turn in datasets/word-docs.

- **Notebook structure**: Organize your notebook with clear section headers matching the question numbers. Include all code cells, output cells, and markdown explanations.

- **Naming convention**: Name your submission file as `Assignment3_<YourNetID>.ipynb` and `Assignment3_<YourNetID>.pdf`.

- Answer the following questions concisely, in complete sentences and with full clarity. If in doubt, ask classmates and the teaching staff. Across group collaboration is not allowed. Always cite all your sources.

- **Code requirements**: Include all necessary imports at the top of your notebook. Ensure your code runs end-to-end without errors. Set random seeds where applicable for reproducibility.

---

### Question 1: RNN for Language Modeling (4 pt)

Build and compare language models using the IMDB dataset.

**Sub-questions:**

- **(a) Data Preparation and N-gram Model (1 pt)**: Import the torchtext IMDB dataset and build a Markov (n-gram) language model.
  - Implement at least a bigram and trigram model
  - Handle unknown words appropriately using smoothing techniques
  - Document your vocabulary size and any preprocessing steps (lowercasing, tokenization, etc.)

- **(b) LSTM Language Model (1.5 pt)**: Adapt the architecture from [Seq2Seq_LSTM_Simple_Sentiment_Analysis.ipynb](examples/M05_recurrent/Seq2Seq_LSTM_Simple_Sentiment_Analysis.ipynb) to build an LSTM-based language model.
  - Modify the model output layer for next-word prediction
  - Train for at least 5 epochs (or until convergence)
  - Plot training loss as a function of epochs/iterations

- **(c) Design Choices Discussion (0.75 pt)**: For each model, describe the key design choices made. Briefly mention how each choice influences training time and generative quality.
  - Discuss: n-gram order, embedding dimension, hidden size, number of layers, dropout rate, learning rate

- **(d) Text Generation (0.75 pt)**: For each model, starting with the phrase "My favorite movie ", sample the next few words and create an approximately 20-word generated review. Repeat this 5 times (you should ideally get different outputs each time) and report the outputs.
  - Use temperature-based sampling for the LSTM model
  - Compare the coherence and diversity of outputs between models

**Deliverables:**
- [ ] Code cell: Data loading and preprocessing pipeline
- [ ] Code cell: N-gram language model implementation with smoothing
- [ ] Code cell: LSTM language model architecture definition
- [ ] Figure: Training loss curve for the LSTM model (properly labeled axes)
- [ ] Table: Summary of design choices for both models
- [ ] Output: 5 generated reviews from each model (10 total), clearly labeled
- [ ] Markdown cell: Discussion of design choices and their impact

**Grading Criteria:**
- Full credit: Both models implemented correctly, training curve shows convergence, generated text is sensible, and design choices are thoroughly discussed
- Partial credit: Models work but with minor issues, incomplete discussion, or limited generation quality
- Minimal credit: Only one model implemented, or significant errors in implementation

Note: Make any reasonable assumptions as necessary and document them clearly.

---

### Question 2: Sequence to Sequence Model for Translation (4 pt)

Build translation models between English and a language of your choice.

**Sub-questions:**

- **(a) Model 1: X → English (1.5 pt)**: Train a sequence-to-sequence model building on the [example notebook](examples/M05_recurrent/Seq2Seq_Translation_Example.ipynb) (also available [here](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)) for a language pair (excluding French-English), where the output is English and the input is a language of your choice from [this collection](https://www.manythings.org/anki/).
  - Document which language pair you selected and why
  - Report the dataset size (number of sentence pairs)
  - Train until validation loss plateaus or for at least 20,000 iterations

- **(b) Model 2: English → X with Pre-trained Embeddings (1.5 pt)**: Train another model for the reverse direction (English to your chosen language). Use GloVe 100-dimensional embeddings for the English encoder. Use [FastText](https://fasttext.cc/docs/en/crawl-vectors.html) embeddings for the target language if available.
  - Compare training dynamics with and without pre-trained embeddings (at least qualitatively)
  - Document embedding coverage (what percentage of vocabulary has pre-trained vectors)

- **(c) Round-Trip Translation (1 pt)**: Input 5 well-formed sentences in English to Model 2, then input the resulting translated sentences to Model 1. Display all model outputs in each case.
  - Select sentences of varying complexity (simple, compound, with idioms, etc.)
  - Analyze where the round-trip translation succeeds and fails

**Deliverables:**
- [ ] Code cell: Data loading for chosen language pair
- [ ] Code cell: Model 1 (X → English) architecture and training
- [ ] Code cell: Model 2 (English → X) architecture with pre-trained embeddings
- [ ] Figure: Training loss curves for both models
- [ ] Table: 5 English sentences, their translations (Model 2 output), and back-translations (Model 1 output)
- [ ] Markdown cell: Analysis of translation quality and failure modes

**Grading Criteria:**
- Full credit: Both models trained successfully, pre-trained embeddings properly integrated, round-trip analysis is insightful
- Partial credit: Models work but embeddings not properly loaded, or superficial analysis
- Minimal credit: Only one direction implemented, or significant training issues

Note: Make any reasonable assumptions as necessary and document them clearly.

---

### Question 3: Model Evaluation with Perplexity and BLEU Score (1.5 pt)

Apply standard NLP evaluation metrics to your models from Questions 1 and 2.

**Sub-questions:**

- **(a) Perplexity for Language Models (0.75 pt)**: Compute and compare the perplexity of your n-gram and LSTM language models on a held-out test set from the IMDB dataset.
  - Use at least 1000 sentences for evaluation
  - Report perplexity for bigram, trigram, and LSTM models
  - Explain what lower/higher perplexity indicates about model quality

- **(b) BLEU Score for Translation (0.75 pt)**: Compute BLEU scores for your translation models using a held-out test set.
  - Report BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores
  - Compare scores between Model 1 and Model 2
  - Discuss what the BLEU scores reveal about translation quality

**Deliverables:**
- [ ] Code cell: Perplexity computation for all language models
- [ ] Code cell: BLEU score computation for translation models
- [ ] Table: Perplexity comparison (bigram, trigram, LSTM)
- [ ] Table: BLEU scores (BLEU-1 through BLEU-4) for both translation directions
- [ ] Markdown cell: Interpretation of metrics and what they reveal about model quality

**Grading Criteria:**
- Full credit: Correct implementation of both metrics, proper test set usage, insightful interpretation
- Partial credit: Metrics computed but with minor errors, or shallow interpretation
- Minimal credit: Only one metric computed, or incorrect implementation

---

### Question 4: Attention Visualization (0.5 pt)

Visualize and interpret attention patterns in your sequence-to-sequence translation model.

**Requirements:**

- Modify Model 1 or Model 2 from Question 2 to include an attention mechanism (if not already present in your implementation)

- For 3 example sentences of varying lengths:
  - Generate a heatmap showing attention weights between source and target tokens
  - Rows should represent target (output) tokens, columns should represent source (input) tokens
  - Use a clear colormap (e.g., viridis or hot) with a colorbar

- Analyze the attention patterns:
  - Do attention weights align with expected word correspondences?
  - How does the model handle word reordering between languages?
  - Are there any surprising or incorrect attention patterns?

**Deliverables:**
- [ ] Code cell: Attention mechanism implementation (if modifying base model)
- [ ] Code cell: Attention weight extraction and visualization function
- [ ] Figure: 3 attention heatmaps with proper labels (source tokens on x-axis, target tokens on y-axis)
- [ ] Markdown cell: Analysis of attention patterns and what they reveal about the translation process

**Grading Criteria:**
- Full credit: Clear, properly labeled heatmaps with insightful analysis of attention patterns
- Partial credit: Heatmaps generated but with labeling issues or shallow analysis
- Minimal credit: Incorrect visualization or no analysis

---

## Hints

### Question 1 Hints

- **Data loading**: Use `torchtext.datasets.IMDB` or the HuggingFace datasets library (`datasets.load_dataset('imdb')`)
- **N-gram smoothing**: Implement Laplace (add-1) smoothing or Kneser-Ney smoothing to handle unseen n-grams
- **LSTM for language modeling**: Change the output dimension to vocabulary size and use `CrossEntropyLoss`. The target at each timestep is the next word.
- **Temperature sampling**: Divide logits by temperature before softmax; lower temperature (0.5-0.8) = more conservative, higher (1.2-1.5) = more diverse
- **Perplexity formula**: $PPL = \exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(w_i|w_1,...,w_{i-1})\right)$

```python
# Useful imports for Question 1
from collections import Counter, defaultdict
import torch.nn as nn
import torch.nn.functional as F
```

### Question 2 Hints

- **Language pair selection**: Spanish, German, and Italian have good data availability. Avoid French (excluded) and very low-resource languages.
- **Loading GloVe embeddings**: Use `torchtext.vocab.GloVe` or download from [Stanford NLP](https://nlp.stanford.edu/projects/glove/)
- **Embedding initialization**: For words not in GloVe/FastText, initialize with random vectors or the mean of all embeddings
- **Freezing vs. fine-tuning embeddings**: Try both `embedding.weight.requires_grad = False` (frozen) and `True` (fine-tuned)

```python
# Loading GloVe embeddings
from torchtext.vocab import GloVe
glove = GloVe(name='6B', dim=100)

# Loading FastText
import fasttext.util
fasttext.util.download_model('de', if_exists='ignore')  # German example
ft = fasttext.load_model('cc.de.300.bin')
```

### Question 3 Hints

- **Perplexity computation**: Use `torch.exp(loss)` where loss is the average cross-entropy loss
- **BLEU score**: Use `nltk.translate.bleu_score` or `sacrebleu` library for standardized computation
- **Test set size**: Use at least 500-1000 examples for reliable metric computation

```python
# BLEU score computation
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction

# For single sentence
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'test']
score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
```

### Question 4 Hints

- **Attention mechanism**: The example notebook already includes attention. Extract weights from `decoder_attention` or similar.
- **Visualization**: Use `matplotlib.pyplot.imshow()` with proper extent and labels
- **Expected patterns**: Diagonal patterns indicate word-by-word correspondence; off-diagonal indicates reordering

```python
# Attention visualization
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

def show_attention(input_sentence, output_words, attentions):
    fig, ax = plt.subplots()
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + input_sentence.split(' ') + ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
    plt.show()
```

---

## FAQ

**Q: Do I need a GPU for this assignment?**
A: A GPU is recommended but not required. On CPU, expect LSTM training to take 1-2 hours for Question 1 and 2-4 hours for Question 2. On a GPU (e.g., Google Colab), training should complete in 15-30 minutes per model. You can reduce training iterations if time-constrained, but document this choice.

**Q: What if my GloVe/FastText embedding coverage is low?**
A: It's acceptable if 70-90% of your vocabulary is covered by pre-trained embeddings. For out-of-vocabulary words, initialize randomly using `torch.randn()` scaled appropriately, or use the mean of all embeddings. Document your coverage percentage and OOV handling strategy.

**Q: How long should my generated reviews be for Question 1?**
A: Aim for approximately 20 words as stated, but anywhere between 15-30 words is acceptable. The key is that the generation should demonstrate the model's capability to produce coherent text.

**Q: Which language should I choose for translation?**
A: Choose any language except French that is available in the [Anki dataset](https://www.manythings.org/anki/). Popular choices include Spanish, German, Italian, Portuguese, or Dutch. Consider your familiarity with the language for manual evaluation of translations.

**Q: What if my BLEU scores are very low?**
A: BLEU scores for neural machine translation on small datasets can be quite low (5-15 is common for these models). Focus on explaining why the scores are what they are, and what could be done to improve them. Low scores are acceptable as long as your analysis is thoughtful.

**Q: Can I use attention in my seq2seq model from the start?**
A: Yes, the example notebook already includes attention. You're encouraged to use attention for better performance. Question 4 specifically asks you to visualize these attention weights.

**Q: How should I handle sentences of different lengths?**
A: Use padding and masking. Pack sequences using `torch.nn.utils.rnn.pack_padded_sequence()` for efficiency, or use simple padding with appropriate mask handling in your loss computation.

**Q: What assumptions can I make about preprocessing?**
A: You can assume:
- Lowercasing all text is acceptable
- Basic tokenization (splitting on whitespace and punctuation) is sufficient
- Limiting vocabulary to top 10,000-50,000 words is acceptable
- Using sentence length limits (e.g., max 50 tokens) for training efficiency is acceptable

**Q: My perplexity values seem too high or too low. What's expected?**
A: For a well-trained LSTM language model on IMDB, expect perplexity in the range of 50-200. N-gram models typically have higher perplexity (100-500). If you get perplexity below 10, check for data leakage; if above 1000, check your loss computation.

---

## Resources

### Course Materials

- [Sequence-to-Sequence and RNNs Slides](slides/M05a_recurrent.pdf) - Core concepts for this assignment
- [Attention Mechanisms Slides](slides/M05b_attention.pdf) - Understanding attention for Question 4
- [NLP Fundamentals Slides](slides/M04_text.pdf) - Word embeddings and language modeling basics
- [Advanced Text Processing Slides](slides/M06a_transformers.pdf) - Additional context on NLP techniques

### Example Notebooks

- [Seq2Seq LSTM Sentiment Analysis](examples/M05_recurrent/Seq2Seq_LSTM_Simple_Sentiment_Analysis.ipynb) - Base for Question 1 LSTM model
- [Seq2Seq Translation Example](examples/M05_recurrent/Seq2Seq_Translation_Example.ipynb) - Base for Question 2
- [Seq2Seq RNN Sentiment Analysis](examples/M05_recurrent/Seq2Seq_RNN_Simple_Sentiment_Analysis.ipynb) - Alternative RNN architecture reference

### External Resources

- [PyTorch Seq2Seq Translation Tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) - Official PyTorch tutorial
- [Anki Parallel Corpora](https://www.manythings.org/anki/) - Translation dataset source
- [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/) - Pre-trained English word vectors
- [FastText Embeddings](https://fasttext.cc/docs/en/crawl-vectors.html) - Pre-trained multilingual word vectors
- [NLTK BLEU Score Documentation](https://www.nltk.org/api/nltk.translate.bleu_score.html) - BLEU implementation reference
- [SacreBLEU Library](https://github.com/mjpost/sacrebleu) - Standardized BLEU computation

### Dataset Download Instructions

**IMDB Dataset (Question 1):**
```python
# Option 1: Using torchtext
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))

# Option 2: Using HuggingFace datasets
from datasets import load_dataset
dataset = load_dataset('imdb')
```

**Translation Dataset (Question 2):**
```python
# Download from https://www.manythings.org/anki/
# Example for German-English:
!wget https://www.manythings.org/anki/deu-eng.zip
!unzip deu-eng.zip

# Or use the data loading from the example notebook
```

**GloVe Embeddings:**
```python
# Option 1: Using torchtext
from torchtext.vocab import GloVe
glove = GloVe(name='6B', dim=100)  # Downloads ~800MB

# Option 2: Manual download
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
```

In [None]:
# Your code starts here
# Recommended: Start with necessary imports

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import random

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Check for GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')