# Assignment 2: Text classification and Word2Vec

**Due date**: January 23, 11:59 PM

For this assignment, you will need to submit two files on Gradescope, `hw2_dist.ipynb` and `hw2_dist.pdf`.  `hw2_dist.ipynb` goes under "Assignment 2 - Code" and `hw2_dist.pdf` goes under "Assignment 2 - Written"

**Total:** 100 points  
**Do not add extra imports unless explicitly allowed.**

**⚠️ IMPORTANT FOR AUTOGRADING:**  
Throughout this notebook, you will see comment annotations like `# Q1.1`, `# Q1.2.1`, `# Q2.1`, etc. **DO NOT remove or modify these annotations.** They are used by the autograder to identify and extract your answers for grading. Removing or changing them may result in your work not being graded correctly.

**Notes**: 
Although not required, we encourage you to choose your own neural network hyperparameters to practice setting up and training MLPs! 
Some suggestions that you can try to tune:

- **Model architecture**
  - `hidden_dim` 
  - number of hidden layers 
  - activation function (e.g., ReLU)
  - `dropout` rate 

- **Training settings**
  - learning rate 
  - optimizer 
  - batch size 
  - number of epochs 
  - weight decay / L2 regularization


If you want to explore other pretrained word embeddings, You can check the full list of available open embeddings in the `gensim` library [here](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models).

Different pretrained embeddings have different vector dimensions (for example, 50, 100, 200, or 300).  Make sure to read the embedding dimension using `wv.vector_size`, and update any code that depends on this value.

## Setup / Imports

You will be working on this assignment on Google Colab unless you have local GPUs.
Run the cell below.  
If you see errors about missing packages, install them in your environment (or follow the README.md associated with this assignment).

In [None]:
# Clone (download) the course repository from GitHub into this Colab session.
# This gives us access to helper files like utils.py that the notebook depends on.
!git clone https://github.com/uchicago-nlp-course/winter2026-assignments.git

# Move into the homework directory so Python can find and import the helper files.
%cd winter2026-assignments/A2

# alternatively, you can create a folder in google with hw2_dist.ipynb, open hw2_dist.ipynb from that folder, and make sure that classification_util.py and word2vec_util.py are in the running folder.

In [None]:
#Install required packages
!pip install gensim nltk 

In [None]:
import sys
assert sys.version_info[0] == 3 and sys.version_info[1] >= 8

# Standard library
import random

# Data and ML
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer

# Deep learning
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

# NLP
from datasets import load_dataset
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Visualization
import matplotlib.pyplot as plt
from tqdm import tqdm

# Local utilities
from classification_util import (
    PAD,
    load_sst2_splits,
    preprocess_sst2,
    load_snli_splits,
    preprocess_snli,
    load_with_retries,
    get_embedding_matrix_and_word2idx,
    SentenceIdDataset,
    SNLIPairIdDataset,
    eval_acc,
    train_dan,
)
from word2vec_util import (
    download_analogy_dataset,
    load_analogy_dataset,
    evaluate_analogies_gensim,
    summarize_results,
)

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Set seeds
np.random.seed(0)
random.seed(0)

## Provided functions
We implemented the train, eval_acc, and load_with_retries functions for you. Please read it carefully to understand what it is doing.

# Problem 1: Text Classification with SST (50 points)

In this part of the assignment, you will study text classification using two common types of text representations: 
- Count-based methods that represent words by how frequently they appear with other words in a corpus. In this part, you will implement Bag-of-Words (BoW) matrix.
- Dense, low-dimensional vector representations of words that capture semantic and syntactic relationships, such as Word2Vec and GloVe.

## Sentiment Analysis 

Sentiment analysis is a natural language understanding task in which a model determines whether a sentence expresses a **positive** or **negative** opinion. Please refer to [lecture 3](https://uchicago-nlp-course.github.io/winter2026-lectures/lecture-03/#/2/1).

## Load SST-2 dataset

The SST-2 dataset is a widely used benchmark for this task. It consists of 215,154 short sentences taken from movie reviews, containinh individual sentences paired with sentiment labels:
- **negative (0)**
- **positive (1)**

For example:

| Sentence | Label |
|---------|-------|
| The movie was thrilling and engaging. | Positive |
| The plot was dull and predictable. | Negative |

In [None]:
# load_sst2_splits is imported from classification_util
train_data, val_data, test_data = load_sst2_splits()
print(f"Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}")
X_train, y_train = preprocess_sst2(train_data)
X_val, y_val = preprocess_sst2(val_data)
X_test, y_test = preprocess_sst2(test_data)

## Question 1.1 - BoW + Logistic Regression on SST2 [code] (15 points)

In this part, we implement a Bag-of-Words (BoW) baseline for the Sentiment Analysis task. You will convert text to a BoW feature vector that captures word occurrence counts. We then train a logistic regression classifier to predict the Sentiment labels.

###  Build BoW Matrix
To keep it simple, you will use `CountVectorizer` function from `scikit-learn` ([CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)). `CountVectorizer` can automatically converts each example into a vector of **word counts**. The result is a sparse matrix of shape **(num_examples, vocab_size)**.

You will need to implement `CountVectorizer` on the three data splits

In [None]:
# Q1.1.bow
print("Vectorizing text...")
# TODO: Write your implementation here.
# Use CountVectorizer to transform text data
# - Fit on training data with min_df=5
# - Transform train, test, and validation sets


# Check shapes (X_train should have 53879 rows)
print(X_train.shape)

### Model Training and Evaluation
In this step, we train a **logistic regression** classifier on the Bag-of-Words representations to perform Sentiment Analysis. Logistic regression is a strong linear baseline for text classification and allows us to evaluate how well simple word-count features capture sentiment relationships. 

Using the trained model, we evaluate performance on the test split and report **overall accuracy**.

**Note**: Feel free to use the validation dataset for hyperparameter tuning!!

In [None]:
print("Training Logistic Regression...")


for c_value in [0.01, 0.1, 1.0, 10.0]:
    # Train on training data
    model = LogisticRegression(C=c_value, max_iter=20)
    # TODO: Write your implementation here.
    # Train a logistic regression model
    # Use validation set for hyperparameter tuning (try different C values)
    
    
# Report best model's test accuracy
print(f"Winner! Best C is {best_c}")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Evaluation
# Q1.1
print(f"Accuracy: {accuracy:.4f}")

## Question 1.2 - DAN model with Open Pretrained Vectors on SST2 [code] (35 points)

In this part, you will perform text classification using pretrained ("open") word embeddings and a small neural network classifier.

Training high-quality word embeddings requires very large datasets and significant compute. For convenience and reproducibility, we will instead use open, pretrained embeddings through the `gensim` library. (e.g., GloVe trained on Wikipedia and Gigaword). These embeddings already capture useful semantic information and allow us to focus on how embeddings are used in downstream models. 

### Model Structure
We referenced the model the model structure of the paper: [Deep Unordered Composition Rivals Syntactic Methods for Text Classification](https://aclanthology.org/P15-1162/) (Iyyer et al., 2015)

The combined averaged embedding feature vector is passed through a **multi-layer perceptron (MLP)**, followed by a **2-way softmax** layer that predicts one of the three NLI labels: **Positive** or **Negative**.

**Note**:

This model architecture has already been introduced in lecture 3. Please refer to the [lecture slides](https://uchicago-nlp-course.github.io/winter2026-lectures/lecture-03/) for a detailed explanation of the model and its motivation.

### Prepare data 
For this homework, we will use `gensim.downloader` to load in the `glove-wiki-gigaword-100` embedding, which is pretrained GloVe word embedding model trained on a large corpus combining Wikipedia and Gigaword news text. It represents each word as a 100-dimensional ($\text{DIM}$=100) vector learned from global word co-occurrence statistics, so that words with similar meanings have similar vectors.

In [None]:
# load_with_retries is imported from classification_util
wv = load_with_retries("glove-wiki-gigaword-100")
DIM = wv.vector_size
print("Loaded vectors:", DIM, "dim | vocab:", len(wv))
embedding_matrix, word2idx = get_embedding_matrix_and_word2idx(wv)
print("Embedding matrix shape:", embedding_matrix.shape)

### Question 1.2.1: Dynamic padding with a collate function (10 points)

You will now implement a collate function to dynamically pad variable-length sentences so they can be batched together.


In [None]:
#Define PAD token as 0
PAD = 0

def collate_pad(batch):
    """
    Collate function for single-sentence datasets with dynamic padding.

    This collate function is responsible for dynamically padding sentences within
    each batch so they can be processed together by the model. Each dataset example
    returns a variable-length sequence of word IDs, since sentences naturally have
    different lengths. When forming a batch, collate_pad takes all sentence ID
    sequences in the batch and pads the shorter ones with the PAD token so that
    every sentence has the same length as the longest sentence in that batch.
    Padding is applied only at the batch level, which is more efficient than
    padding all sentences to a global maximum length. The function also stacks the
    labels into a single tensor. As a result, the model receives a rectangular
    tensor of shape (batch_size, max_sentence_length_in_batch) along with the
    corresponding labels, while ensuring that padding tokens do not represent
    real words.

    For more information, see:
    https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html

    Input:
        batch: list of (ids, label) tuples
    Output:
        ids_padded: LongTensor of shape (B, T_max)
        labels: LongTensor of shape (B,)
    """
    # TODO: Implement the collate function.
    # 1. Separate the batch into a list of token ID tensors and a list of labels
    # 2. Pad the token ID sequences so they all have the same length
    #    - Use pad_sequence
    # 3. Stack labels into a single tensor
    # 4. Return the padded token IDs and labels

    return ids_padded, labels

### Testing the collate_pad function

Before using `collate_pad` from `classificatio_utils.py`, let's verify it works correctly on simple examples. This function pads sentences to the same length within a batch.

In [None]:
# Q1.2.1
def test_collate_pad_sst():
    """
    Test collate_pad with simple SST-style examples (single sentence).

    Expected behavior:
    - Sequences should be padded to max length in the batch
    - Labels should be stacked into a tensor
    - PAD token (0) should be used for padding
    """
    # Create simple test batch with variable-length sequences
    # Format: (ids, label)
    batch = [
        (torch.tensor([1, 2, 3]), torch.tensor(0)),          # len=3
        (torch.tensor([4, 5]), torch.tensor(1)),             # len=2
        (torch.tensor([6, 7, 8, 9]), torch.tensor(0)),       # len=4  (max)
    ]

    ids_padded, labels = collate_pad(batch)

    # Check shapes
    assert ids_padded.shape == (3, 4), f"Expected ids shape (3, 4), got {ids_padded.shape}"
    assert labels.shape == (3,), f"Expected labels shape (3,), got {labels.shape}"

    # Check padding values (PAD=0)
    # First: [1,2,3] -> [1,2,3,0]
    assert ids_padded[0].tolist() == [1, 2, 3, 0], f"First example incorrect: {ids_padded[0].tolist()}"
    # Second: [4,5] -> [4,5,0,0]
    assert ids_padded[1].tolist() == [4, 5, 0, 0], f"Second example incorrect: {ids_padded[1].tolist()}"
    # Third: [6,7,8,9] -> no padding
    assert ids_padded[2].tolist() == [6, 7, 8, 9], f"Third example incorrect: {ids_padded[2].tolist()}"

    # Check labels
    assert labels.tolist() == [0, 1, 0], f"Labels incorrect: {labels.tolist()}"

    print("All tests passed!")
    print(f"ids_padded shape: {ids_padded.shape}")
    print(f"labels: {labels}")

# Run the test
test_collate_pad_sst()

### Question 1.2.2 Tokenization (5 points)
In this part, you will write a function `tokens_to_ids` that converts tokens into their corresponding word IDs and enforces a maximum sequence length. The helper function below will later be called by `SentenceIdDataset` to map tokenized sentences to model-ready word ID sequences.

In [None]:
# Q1.2.2
def tokens_to_ids(tokens, word2idx, max_len=50):
    """Convert tokens to IDs, truncating to max_len."""
    # TODO: Convert tokens to word IDs.
    # 1. Iterate over the input tokens
    # 2. Truncate the sequence to max_len tokens
    # 3. Return the list of token IDs


    return ids

In [None]:
# Create DataLoaders for SST-2
train_ds = SentenceIdDataset(train_data, word2idx, tokens_to_ids, max_len=50)
val_ds   = SentenceIdDataset(val_data, word2idx, tokens_to_ids, max_len=50)
test_ds  = SentenceIdDataset(test_data, word2idx, tokens_to_ids, max_len=50)

train_loader = DataLoader(train_ds, batch_size=256, shuffle=True,  collate_fn=collate_pad)
val_loader   = DataLoader(val_ds, batch_size=256, shuffle=False, collate_fn=collate_pad)
test_loader  = DataLoader(test_ds, batch_size=256, shuffle=False, collate_fn=collate_pad)

### Problem 1.2.3 Training and evaluation (15 points)

In this part, you will need to finish the `SSTDANClassifier` model and train **two versions of the model**:

- **Frozen embeddings** (`freeze=True`):  
  The pretrained word embeddings are kept fixed, and only the classifier parameters are updated.

- **Trainable embeddings** (`freeze=False`):  
  The pretrained word embeddings are fine-tuned jointly with the classifier.

You should report results for **both settings** and briefly compare their performance.

The model is trained using the **AdamW** optimizer and **cross-entropy loss**.

**Training settings**:
- **Learning rate**: `3e-4`
- **Weight decay**: `1e-4`
- **Epochs**: `10`

After training, report the **test accuracy**.

**Note**: The trainable version would take about 6 min to run

In [None]:
class SSTDANClassifier(nn.Module):
    """DAN classifier for single-sentence classification (SST-2)."""
    
    def __init__(self, embedding_matrix, freeze=True, num_classes=2):
        """
        Args:
            embedding_matrix: numpy array of shape (V+1, D) with pretrained embeddings
            freeze: if True, embeddings are not updated during training
            num_classes: number of output classes (2 for SST)
        """
        super().__init__()
        # TODO: Write your implementation here.
        # 1. Create self.embedding using nn.Embedding.from_pretrained
        # 2. Create self.mlp as an nn.Sequential with:
        #    - 3 hidden layers of size 256 with ReLU and Dropout(0.2)
        #    - Final linear layer to num_classes
        pass

    def masked_avg(self, ids):
        """
        Compute average embedding for a batch of sentences, ignoring PAD tokens.
        
        Args:
            ids: LongTensor of shape (B, T)
        Returns:
            Tensor of shape (B, D) - averaged embeddings
        """
        # TODO: Write your implementation here.
        # 1. Look up embeddings: emb = self.embedding(ids)  -> (B, T, D)
        # 2. Create mask for non-PAD tokens: mask = (ids != PAD)
        # 3. Zero out PAD embeddings
        # 4. Sum embeddings and divide by number of non-PAD tokens
        pass

    def forward(self, ids):
        """
        Forward pass for single-sentence classification.
        
        Args:
            ids: LongTensor of shape (B, T) - sentence token IDs
        Returns:
            logits: Tensor of shape (B, num_classes)
        """
        # TODO: Write your implementation here.
        # 1. Get averaged sentence embedding using masked_avg
        # 2. Pass through MLP
        pass

In [None]:
# 1.2.1 Train + Eval (Freeze embeddings)

model_frozen = # define the model
model_frozen = # call train_dan with appropriate parameters
accuracy = # call eval_acc with appropriate parameters

# Q1.2.a
print("test acc with frozen embeddings:", accuracy)

In [None]:
# 1.2 TODO: Train + Eval (Trainable embeddings)
model_trainable = # define the model
model_trainable = # call train_dan with appropriate parameters
accuracy = # call eval_acc with appropriate parameters

# Q1.2.b
print("test acc with trainable embeddings:", accuracy)

## Question 1.3 - Result interpretation - SST2 [written] (5 points)

Compare the accuracy results across different approaches and explain your observations.

#### <font color="red">Write your answer here.</font>

## Problem 2: Natural Language Inference (30 points)
NLI evaluates a model's ability to understand sentence meaning, reasoning, and semantic relationships, making it a core benchmark for assessing natural language understanding. Please refer to [lecture 3](https://uchicago-nlp-course.github.io/winter2026-lectures/lecture-03/#/2/3) for more information.

## Load SNLI dataset

The [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) corpus is a collection of 570k human-written English sentence pairs that is designed for natural language inference (NLI), also known as textual entailment. If you'd like to explore NLI in more depth, we encourage you to check out [this original SNLI paper](https://arxiv.org/abs/1508.05326) for intuitive explanations and experiments of NLI.

SNLI has sentence pairs `(premise, hypothesis)` with labels:
- `entailment  (0)` 
- `contradiction (2)`
- `neutral (1)`

We load SNLI with predefined **train**, **validation**, and **test** splits. We then filter out invalid examples by removing rows with missing sentences or unlabeled instances (where `label == -1`). For consistency in preprocessing, we convert both the premise and hypothesis text to lowercase.

In [None]:
# load_snli_splits is imported from classification_util
train_data, val_data, test_data = load_snli_splits()
print(len(train_data), len(val_data), len(test_data))

In [None]:
print(train_data[0]["premise"])
print(train_data[0]["hypothesis"])
print(train_data[0]["label"])

## Question 2.1 - BoW + Logistic Regression on SNLI [code] (10 points)

In this part, You will use the same method in Question 1.1 and implement it on Natural Language Inference(NLI). This model will be serve as a baseline model for this NLI task.

### Preprocess
Unlike sentiment analysis, the prediction depends on the relationship between two sentences in natural language inference (NLI), not on either sentence alone. To allow a single text classifier to consider information from both the premise and the hypothesis, we provided a `preprocess_snli` function to concatenate them into one combined input string. The separator token (`[SEP]`) marks the boundary between the two sentences, helping preserve their roles while keeping the model architecture simple.

In [None]:
# preprocess_snli is imported from classification_util
# It concatenates premise and hypothesis with [SEP] separator
X_train_raw, y_train = preprocess_snli(train_data) 
X_test_raw, y_test = preprocess_snli(test_data)
X_val_raw, y_val = preprocess_snli(val_data)

print(X_train_raw[0])

In [None]:
# Q2.1.bow
print("Vectorizing text...")
# TODO: Write your implementation here.
# Same approach as SST2


# Check shapes (X_train should have 549367 rows)
print(X_train.shape)

In [None]:
print("Training Logistic Regression...")
# TODO: Write your implementation here.
# Same approach as SST2


# Evaluation
# Q2.1
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['entailment', 'neutral', 'contradiction']))

## Question 2.2 - GloVe + MLP classifier for SNLI [code+written] (15 points)

In this part, In this part, you will repeat the **averaged word embeddings + MLP (DAN-style)** classification pipeline from the previous section using the same fixed embedding `glove-wiki-gigaword-100`. 

### Model Structure
Unlike SST2, you will need to feed multiple sentences into the neural network. In this part, our model follows a Siamese-style sentence pair architecture, where the premise and hypothesis are processed independently using the same encoder. In addition to the first paper (Iyyer et al., 2015), We referenced the model structure of the following paper:
- [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364) (Conneau et al., 2017)

The first part is the same, each sentence is converted into a fixed-length vector by averaging pretrained word embeddings (Iyyer et al., 2015).

Since the is a NLI problem, you will instead obtain:
- $p$: averaged embedding for the premise
- $h$: averaged embedding for the hypothesis

After encoding the premise and hypothesis into vectors $\mathbf{p}$ and $\mathbf{h}$, we explicitly model their relationship by constructing a **comparison vector**(Conneau et al., 2017):
$$
\left[ \mathbf{p},\; \mathbf{h},\; |\mathbf{p} - \mathbf{h}|,\; \mathbf{p} \odot \mathbf{h} \right]
$$

Here, the absolute difference $|\mathbf{p} - \mathbf{h}|$ captures how far the two sentences are **semantically apart** in a symmetric way, while the element-wise product $\mathbf{p} \odot \mathbf{h}$ captures **feature overlap and similarity** between them. These comparison features are widely used in natural language inference models because they make it easier for the classifier to reason about sentence relationships.

The combined feature vector is then passed through a **multi-layer perceptron (MLP)**, followed by a **3-way softmax** layer that predicts one of the three NLI labels: **entailment**, **neutral**, or **contradiction**.

### Question 2.2.1: Dynamic padding with a collate function (5 points)

Just like in the SST problem, you will implement a collate function for sentence-pair SNLI dataset.
The function should dynamically pad premise and hypothesis sequences within each batch and return padded tensors along with the corresponding labels.

In [None]:
def collate_pair_pad(batch):
    """
    Collate function for sentence-pair datasets (SNLI) with dynamic padding.

    Input:
        batch: list of (premise_ids, hypothesis_ids, label) tuples
    Output:
        p_padded: LongTensor of shape (B, Tp_max)
        h_padded: LongTensor of shape (B, Th_max)
        labels: LongTensor of shape (B,)
    """
    # TODO: Implement the collate function for sentence pairs.
    # 1. Separate the batch into premise IDs, hypothesis IDs, and labels
    # 2. Pad premise ID sequences to the max premise length in the batch
    # 3. Pad hypothesis ID sequences to the max hypothesis length in the batch
    # 5. Return padded premises, padded hypotheses, and labels

    return p_padded, h_padded, labels


### Testing the collate_pair_pad function

Before using `collate_pair_pad` from `utils.py`, let's verify it works correctly on simple examples. This function pads premise-hypothesis pairs to the same length within a batch.

In [None]:
# Q2.2.1
def test_collate_pair_pad():
    """
    Test collate_pair_pad with simple examples.
    
    Expected behavior:
    - Premises should be padded to max premise length in batch
    - Hypotheses should be padded to max hypothesis length in batch
    - Labels should be stacked into a tensor
    - PAD token (0) should be used for padding
    """
    # Create simple test batch with variable-length sequences
    # Format: (premise_ids, hypothesis_ids, label)
    batch = [
        (torch.tensor([1, 2, 3]), torch.tensor([4, 5]), torch.tensor(0)),        # premise len=3, hyp len=2
        (torch.tensor([6, 7]), torch.tensor([8, 9, 10, 11]), torch.tensor(1)),   # premise len=2, hyp len=4
        (torch.tensor([12]), torch.tensor([13, 14, 15]), torch.tensor(2)),       # premise len=1, hyp len=3
    ]
    
    p_padded, h_padded, labels = collate_pair_pad(batch)
    
    # Check shapes
    assert p_padded.shape == (3, 3), f"Expected premise shape (3, 3), got {p_padded.shape}"
    assert h_padded.shape == (3, 4), f"Expected hypothesis shape (3, 4), got {h_padded.shape}"
    assert labels.shape == (3,), f"Expected labels shape (3,), got {labels.shape}"
    
    # Check padding values (PAD=0)
    # First premise [1,2,3] should have no padding
    assert p_padded[0].tolist() == [1, 2, 3], f"First premise incorrect: {p_padded[0].tolist()}"
    # Second premise [6,7] should be padded to [6,7,0]
    assert p_padded[1].tolist() == [6, 7, 0], f"Second premise incorrect: {p_padded[1].tolist()}"
    # Third premise [12] should be padded to [12,0,0]
    assert p_padded[2].tolist() == [12, 0, 0], f"Third premise incorrect: {p_padded[2].tolist()}"
    
    # Check hypothesis padding
    # First hypothesis [4,5] should be padded to [4,5,0,0]
    assert h_padded[0].tolist() == [4, 5, 0, 0], f"First hypothesis incorrect: {h_padded[0].tolist()}"
    # Second hypothesis [8,9,10,11] should have no padding
    assert h_padded[1].tolist() == [8, 9, 10, 11], f"Second hypothesis incorrect: {h_padded[1].tolist()}"
    # Third hypothesis [13,14,15] should be padded to [13,14,15,0]
    assert h_padded[2].tolist() == [13, 14, 15, 0], f"Third hypothesis incorrect: {h_padded[2].tolist()}"
    
    # Check labels
    assert labels.tolist() == [0, 1, 2], f"Labels incorrect: {labels.tolist()}"
    
    print("All tests passed!")
    print(f"Premise padded shape: {p_padded.shape}")
    print(f"Hypothesis padded shape: {h_padded.shape}")
    print(f"Labels: {labels}")

# Run the test
test_collate_pair_pad()

### SNLIDANClassifier: Extending SSTDANClassifier for sentence pairs (10 points)

For SNLI, we need to process **two sentences** (premise and hypothesis) instead of one. The `SNLIDANClassifier` extends `SSTDANClassifier` by:

1. **Inheriting** the `masked_avg` method from the parent class (no need to reimplement!)
2. **Overriding** `__init__` to change the MLP input dimension to `4 * D` (for the comparison vector)
3. **Overriding** `forward` to:
   - Compute averaged embeddings for both premise ($p$) and hypothesis ($h$)
   - Construct the comparison vector: $[p, h, |p-h|, p \odot h]$
   - Pass through the MLP

In [None]:
class SNLIDANClassifier(SSTDANClassifier):
    """DAN classifier for sentence-pair classification (SNLI).
    
    Inherits masked_avg from SSTDANClassifier.
    You need to implement __init__ and forward.
    """
    
    def __init__(self, embedding_matrix, freeze=True, num_classes=3):
        """
        Args:
            embedding_matrix: numpy array of shape (V+1, D) with pretrained embeddings
            freeze: if True, embeddings are not updated during training
            num_classes: number of output classes (3 for SNLI)
        """
        # NOTE: We call nn.Module.__init__ directly to avoid calling parent's __init__
        # which would create an MLP with wrong input dimension
        nn.Module.__init__(self)
        
        # TODO: Write your implementation here.
        # 1. Create self.embedding using nn.Embedding.from_pretrained (same as SSTDANClassifier)
        # 2. Create self.mlp with input dimension 4*D (for [u, v, |u-v|, u*v])
        #    Same architecture: 3 hidden layers of 256 with ReLU and Dropout(0.2)
        pass

    def forward(self, p_ids, h_ids):
        """
        Forward pass for sentence-pair classification.
        
        Args:
            p_ids: LongTensor of shape (B, Tp) - premise token IDs
            h_ids: LongTensor of shape (B, Th) - hypothesis token IDs
        Returns:
            logits: Tensor of shape (B, num_classes)
        """
        # TODO: Write your implementation here.
        # 1. Get averaged embeddings for premise (u) and hypothesis (v) using self.masked_avg
        # 2. Construct comparison vector: [u, v, |u-v|, u*v]
        # 3. Pass through MLP
        pass

### Training a DAN-style MLP classifier on the SNLI averaged-embedding features
In this part, you will repeat the **averaged word embeddings + MLP (DAN-style)** model structure from the previous section. 
You **do not** need to implement tokenization, dataset classes, or data loaders for this question.  
These components are provided and are conceptually the same as in the SST experiments, extended to sentence pairs. For convenience, we will help you implement the tokenization part with helper functions from `utils.py`.

For this part, you will use the `SNLIDANClassifier` and the shared `train_dan` function with `snli_mode=True`. Since the training takes a lot of time, you will only need to run 3 epochs for the trainable version.

**Note**:

The training of this part may take up to 20 min on cpu.

In [None]:
train_ds = SNLIPairIdDataset(train_data, word2idx, tokens_to_ids, max_len=50)
val_ds   = SNLIPairIdDataset(val_data,   word2idx, tokens_to_ids, max_len=50)
test_ds  = SNLIPairIdDataset(test_data,  word2idx, tokens_to_ids,max_len=50)

train_loader = DataLoader(train_ds, batch_size=256, shuffle=True,  collate_fn=collate_pair_pad)
val_loader   = DataLoader(val_ds,   batch_size=512, shuffle=False, collate_fn=collate_pair_pad)
test_loader  = DataLoader(test_ds,  batch_size=512, shuffle=False, collate_fn=collate_pair_pad)

In [None]:
# 2.2.1 Train + Eval (Freeze embeddings)
model_frozen = # define the model
model_frozen = # call train_dan with appropriate parameters
accuracy = # call eval_acc with appropriate parameters

# Q2.2.a
print("test acc with frozen embeddings:", accuracy)

In [None]:
# 2.2.2 Train + Eval (Trainable embeddings)
model_trainable = # define the model
model_trainable = # call train_dan with appropriate parameters
accuracy = # call eval_acc with appropriate parameters

# Q2.2.b
print("test acc with trainable embeddings:", accuracy)

## Question 2.3 - Result interpretation [written] (5 points)

Which embedding type performed better (BoW vs Pretrained GloVe)? Why might pretrained GloVe outperform frequency-based embeddings?

#### <font color="red">Write your answer here.</font>

# Problem 3: Word2Vec (20 points)

In this problem, you will study how the performance of Word2Vec embeddings depends on training data size and embedding dimensionality.

You will train multiple Word2Vec models using the **skip-gram architecture**, varying only:
- the number of training sentences, and
- the dimensionality of the word vectors.

All other hyperparameters should be kept fixed across models.  
This experimental setup follows the analysis presented in the original [word2vec's original paper](https://arxiv.org/abs/1301.3781) (Mikolov et al., 2013), where different training settings are compared using word analogy evaluations(see the table below).

![CBOW table](CBOW_table.png)

To evaluate embedding quality, you will use the Google Semantic-Syntactic Word Analogy dataset.

You will train a total of **9 models**. Results should be reported in a table following the format of Table 2, with rows corresponding to embedding dimensionality and columns corresponding to training data size.

### Preparing sentences for Word2Vec
We will train a word2vec model on the [Simple Wikipedia](https://huggingface.co/datasets/rahular/simple-wikipedia) dataset. This corpus contains 770k rows from a text-only dump of the Simple Wikipedia (English).

In this section, we will convert the raw Simple Wikipedia text into a **list of tokenized sentences**, which is the expected input format for training Word2Vec.
The output is a Python list where each element is one sentence represented as a list of tokens, e.g.
`[['the', 'cat', 'sat', 'down'], ['it', 'was', 'tired'], ...]`. 

To do so, we will use `nltk`, a popular library for many text processing frameworks and resources. Read more [here](https://www.nltk.org/).

Here we help you implement the `read_and_sentencize_corpus`, which can later be used to read and tokenize the Simple Wikipedia dataset.

In [None]:
def read_and_sentencize_corpus(data):
    """ Read files from the Simple Wikipedia dataset.

        Return:
            list of lists, with tokenized sentences from each of the processed rows
    """
    tokenized_sentences = []
    files = data["text"]
    for review in tqdm(files):
        # Split reviews into sentences
        sentences = sent_tokenize(review)
        for sentence in sentences:
            # Tokenize and preprocess each sentence
            tokenized_sentences.append(word_tokenize(sentence.lower()))
    return tokenized_sentences

We'll now implement the training of a word2vec model on our sentences. word2vec embeddings convert words into dense vector representations that capture semantic relationships between them. Training the model involves analyzing the context in which words appear within sentences to learn these meaningful vector representations.

This function trains a Word2Vec model with the following settings:
- the training sentences provided via the input argument `sentences`
- embedding dimensionality set by the input argument `dim`
- context window size of 5
- minimum word count of 5
- skip-gram training algorithm

All other parameters should be left at their default values.

This function will be reused in later parts of the assignment to train **9 different Word2Vec models**, where both the **training corpus size (and resulting vocabulary)** and the **embedding dimensionality** are varied. Keeping the training logic encapsulated in a single function ensures consistency across experimental settings.

In [None]:
def train_word2vec(sentences, dim):
    """
    Inputs:
        sentences (List[List[str]]):
            Subset of the tokenized training corpus. 

        dim (int):
            Dimensionality of the learned word vectors.

    Returns:
        word_vectors (gensim.models.KeyedVectors):
            Trained word vectors learned by Word2Vec. 
    """
    model = Word2Vec(
        sentences=sentences,
        vector_size=dim,
        min_count=5,
        window=5,
        sg=1,  # Skip-gram
        workers=4,
    )
    word_vectors = model.wv

    return word_vectors

## Question 3.1 - Train Word2Vec Scaling Grid [code] (10 points)

In this question, you will implement a function that trains a **grid of Word2Vec models** across multiple training data sizes and embedding dimensionalities.

You will evaluate the following parameter grid:
- **Training sizes**: 100k, 200k, 400k sentences
- **Embedding dimensions**: 50, 100, 300

The goal is to systematically study how Word2Vec performance scales as:
- the number of training sentences increases, and
- the dimensionality of the word embeddings increases.

The function `train_scaling_grid` should iterate over all combinations of the provided training sizes (`train_sizes`) and embedding dimensions (`embed_dims`). For each combination, it should:
1. load a subset of the training corpus containing the specified number of sentences
2. convert the raw text into tokenized sentences suitable for Word2Vec training
3. train a Word2Vec model using the `train_word2vec` function implemented in the previous question, and
4. store the trained word vectors in a dictionary indexed by `(number_of_sentences, embedding_dimension)`.

The function should return a dictionary mapping each `(n_sentences, dim)` pair to its corresponding trained Word2Vec model.

**Hints**:
Use the provided `read_and_sentencize_corpus` to convert raw text into a list of tokenized sentences before training Word2Vec.

**Note**:
This may take around 20 minutes to train 

In [None]:
# Q3.1
train_sizes = [100000, 200000, 400000]
embed_dims  = [50, 100, 300]

max_n = max(train_sizes)
#Load 400000 sentences from Simple Wikipedia
ds = load_dataset("rahular/simple-wikipedia", split=f"train[:{max_n}]")
all_sents = read_and_sentencize_corpus(ds)

def train_scaling_grid(train_sizes, embed_dims, base_seed=0):
    """
    Inputs:
        train_sizes (List[int]): 
            List of training corpus sizes, where each value specifies the number
            of sentences to use for training a Word2Vec model.

        embed_dims (List[int]): 
            List of embedding dimensionalities to use when training Word2Vec models.

        base_seed (int): 
            Base random seed for reproducibility across different training runs.

    Returns:
        models (dict): 
            A dictionary mapping (n_sentences, dim) -> trained Word2Vec word vectors.
            Each entry corresponds to a Word2Vec model trained using `n_sentences`
            sentences and embedding dimensionality `dim`.
        token_counts (dict):
            A dictionary records the number of words in the training corpus for each size.
    """
    models = {}
    token_counts = {}
    # TODO: Write your implementation here.
    # For each training size:
    #   1. Load the first n sentences from Simple Wikipedia subset all_sents
    #   2. Tokenize using read_and_sentencize_corpus
    #   3. Count tokens
    #   4. For each embedding dimension, train Word2Vec model
    #   5. Store in models dict with key (n, dim)

    return models, token_counts

# Train models (this can take time; you can reduce train_sizes or epochs for debugging)
models, word_counts = train_scaling_grid(train_sizes, embed_dims, base_seed=42)

print("\nTrained models:", len(models))

## Question 3.2 Word Analogy Evaluation [written] (5 points)

To evaluate the quality of the trained Word2Vec embeddings, we use the **Google Semantic-Syntactic Word Analogy dataset**, introduced by Mikolov et al. (2013).

This dataset consists of analogy questions of the form:
$$
a : b :: c : d
$$
which can be read as *"a is to b as c is to d."*  
Each question tests whether word vectors encode linguistic regularities that can be recovered through vector arithmetic.

To answer an analogy question $(a, b, c, d)$, the model computes the vector:
$$
\mathbf{v}(b) - \mathbf{v}(a) + \mathbf{v}(c)
$$
and checks whether the word whose embedding is **closest in cosine similarity** to this result is the correct target word $d$.

All evaluation utilities have already been implemented for you in `utils.py`. In particular, the provided functions:
- `download_analogy_dataset`
- `load_analogy_dataset`
- `evaluate_analogies_gensim`
- `summarize_results`
handle dataset loading, analogy evaluation, and result aggregation. You do not need to write any additional code in this section.

The code below evaluates each trained Word2Vec model in `models` and computes analogy accuracy across different analogy categories. The results are then summarized for each `(number_of_sentences, embedding_dimension)` setting.

Based on the printed outputs and summarized results, you need to write a short analysis on this experement. What results do you find interesting?

In [None]:
# 1) Load analogy dataset once
path = download_analogy_dataset()              # downloads only if missing
analogy_data = load_analogy_dataset(path)
print(f"Loaded {len(analogy_data)} analogy categories")

# Show total questions
total = sum(len(q) for q in analogy_data.values())
print(f"Total questions: {total}")

In [None]:
# 2) Evaluate each trained model in MODELS
# MODELS maps: (n_sentences, dim) -> trained KeyedVectors (wv)
results = {}
for key, wv in models.items():
    # evaluate_analogies_gensim returns per-category stats (e.g., correct/total)
    results[key] = evaluate_analogies_gensim(wv, analogy_data)

In [None]:
# 3) Summarize per-category results into one overall accuracy per model
all_results = {}
for (n_sent, dim), per_cat_results in results.items():
    name = f"n={n_sent}_dim={dim}"
    # summarize_results returns a dict with 'overall_accuracy', totals, and per-category breakdown
    all_results[(n_sent, dim)] = summarize_results(
        per_cat_results,
        name=name
    )

In [None]:
# 4) Convert summarized results into a DataFrame for easy pivoting
rows = []
for (n_sent, dim), s in all_results.items():
    rows.append({
        "dim": dim,
        "n_sentences": n_sent,
        "accuracy": 100 * s["overall_accuracy"],  
    })
df = pd.DataFrame(rows)
df["n_words"] = df["n_sentences"].map(word_counts)

# Q3.2
df

#### <font color="red">Write your answer here.</font>

## Question 3.3 - Visualization of Word2Vec Scaling Behavior [code] (5 Points)

In this question, you will visualize how Word2Vec analogy accuracy changes as a function of training data size and embedding dimensionality.

Using the summarized results from the previous section, create a line plot where:
- the x-axis represents the number of words(not sentences),
- the y-axis represents analogy accuracy (in percentage),
- each line corresponds to a different embedding dimensionality.

This plot provides a visual summary of the scaling behavior observed in the Word2Vec experiments and complements the table reported earlier.

In [None]:
plt.figure(figsize=(7, 5))
# TODO: Write your implementation here.
# Create a line plot showing:
# - x-axis: number of training words
# - y-axis: analogy accuracy (%)
# - separate line for each embedding dimension


plt.xlabel("Number of training words")
plt.ylabel("Analogy accuracy (%)")
plt.title("Word2Vec Analogy Accuracy vs Training Size")
plt.legend(title="Embedding dimension")
plt.grid(True)
plt.tight_layout()
plt.show()