In [None]:
%pip install datasets
%pip install transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset

dataset = load_dataset("commonsense_qa")

print(dataset)


  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
        num_rows: 9741
    })
    validation: Dataset({
        features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
        num_rows: 1221
    })
    test: Dataset({
        features: ['id', 'question', 'question_concept', 'choices', 'answerKey'],
        num_rows: 1140
    })
})


In [3]:
sample = dataset["train"][0]
print(sample)

{'id': '075e483d21c29a511267ef62bedc0461', 'question': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?', 'question_concept': 'punishing', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']}, 'answerKey': 'A'}


In [4]:
sample = dataset["validation"][0]
print(sample)

{'id': '1afa02df02c908a558b4036e80242fac', 'question': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?', 'question_concept': 'revolving door', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['bank', 'library', 'department store', 'mall', 'new york']}, 'answerKey': 'A'}


In [5]:
sample = dataset["test"][0]
print(sample)

{'id': '90b30172e645ff91f7171a048582eb8b', 'question': 'The townhouse was a hard sell for the realtor, it was right next to a high rise what?', 'question_concept': 'townhouse', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['suburban development', 'apartment building', 'bus stop', 'michigan', 'suburbs']}, 'answerKey': ''}


In [6]:
import pandas as pd

df_train = pd.DataFrame(dataset["train"])
df_val = pd.DataFrame(dataset["validation"])
df_test = pd.DataFrame(dataset["test"])

print(f"Train Size: {len(df_train)}")
print(f"Validation Size: {len(df_val)}")
print(f"Test Size: {len(df_test)}")

print(df_train.head())


Train Size: 9741
Validation Size: 1221
Test Size: 1140
                                 id  \
0  075e483d21c29a511267ef62bedc0461   
1  61fe6e879ff18686d7552425a36344c8   
2  4c1cb0e95b99f72d55c068ba0255c54d   
3  02e821a3e53cb320790950aab4489e85   
4  23505889b94e880c3e89cff4ba119860   

                                            question question_concept  \
0  The sanctions against the school were a punish...        punishing   
1  Sammy wanted to go to where the people were.  ...           people   
2  To locate a choker not located in a jewelry bo...           choker   
3  Google Maps and other highway and street GPS s...          highway   
4  The fox walked from the city into the forest, ...              fox   

                                             choices answerKey  
0  {'label': ['A', 'B', 'C', 'D', 'E'], 'text': [...         A  
1  {'label': ['A', 'B', 'C', 'D', 'E'], 'text': [...         B  
2  {'label': ['A', 'B', 'C', 'D', 'E'], 'text': [...         A  
3  {'label'

In [None]:
**Final Submission – Data Processing and Model Training**

## **1. Introduction**

This notebook serves as the final submission for the project on **Word Embeddings & RNNs** using the **CommonsenseQA dataset**. It includes:

- Data loading
- Data preprocessing (tokenization, cleaning, padding)
- Integration of Word Embeddings (word2vec)
- Model definition using a two-layer feedforward classifier
- Loss function and optimizer setup
- Experiment tracking with Weights & Biases (WandB)

## **2. Setup & Installations**

**Rationale:**

- `torch`: Required for model training and tensor processing.
- `datasets`: Used to load the CommonsenseQA dataset from Hugging Face.
- `nltk`: Employed for tokenization and text preprocessing.
- `gensim`: Used for loading pre-trained word embeddings (word2vec).
- `torch.nn`: Required to define the model architecture.
- `torch.optim`: Used for training the model with gradient updates.
- `wandb`: Used for tracking experiments, logging metrics, and visualizing model performance.
- `argparse`: Enables dynamic configuration of hyperparameters.

```python
# Install required packages
%pip install torch datasets nltk gensim wandb

# Import necessary libraries
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
import nltk
from nltk.tokenize import word_tokenize
import re
from gensim.models import KeyedVectors
import gensim.downloader as api
import torch.nn as nn
import torch.optim as optim
import wandb
import argparse
```

## **3. Data Loading**

**Rationale:**

- Follows the exact project specification from the Course Project.pdf.
- Loads specific dataset splits for training, validation, and test using the TAU version.

```python
from datasets import load_dataset
train = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test = load_dataset("tau/commonsense_qa", split="validation")

# Check dataset sizes
print(len(train), len(valid), len(test))
```

## **4. Data Preprocessing**

**Rationale:**

- Prepares raw text inputs for downstream modeling.
- Ensures text normalization through lowercasing and punctuation removal.
- Tokenization helps break input into word-level units for embeddings.
- Padding standardizes input lengths for batch processing.

```python
nltk.download('punkt')

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return word_tokenize(text)

```

## **5. Dataset Class (Shared)**

**Rationale:**

- Provides a PyTorch-compatible wrapper around the CommonsenseQA dataset.
- Centralizes tokenization, preprocessing, and answer label formatting.

```python
class CommonsenseQADataset(Dataset):
    def __init__(self, split="train"):
        if split == "train":
            self.dataset = train
        elif split == "validation":
            self.dataset = valid
        elif split == "test":
            self.dataset = test
        else:
            raise ValueError(f"Invalid split: {split}")
        self.processed_data = self.process_data()

    def process_data(self):
        processed = []
        for item in self.dataset:
            question = preprocess_text(item["question"])
            choices = [preprocess_text(choice) for choice in item["choices"]["text"]]
            answer = ord(item["answerKey"]) - ord('A')
            processed.append((question, choices, answer))
        return processed

    def __len__(self):
        return len(self.processed_data)

    def __getitem__(self, idx):
        return self.processed_data[idx]
````

## **6. Load Word Embeddings (word2vec)**

**Rationale:**

- I use Gensim's pre-trained Google News vectors (300D) for semantic information.
- Pre-trained embeddings reduce the need for large training data.
- Unknown words return a zero vector, which avoids crashes during lookup.

```python
word2vec = api.load("word2vec-google-news-300")

def get_word_vector(word):
    return word2vec[word] if word in word2vec else torch.zeros(300)

print(get_word_vector("cold")[:10])
```

## **7. Feedforward Data Pipeline**

**Rationale:**

- Converts token sequences into averaged embeddings.
- Simplifies the input into fixed-size vectors.
- Suitable for feedforward networks requiring consistent input shape.

```python
def embed_sequence(tokens):
    vectors = [get_word_vector(token) for token in tokens]
    if len(vectors) == 0:
        return torch.zeros(300)
    return torch.stack(vectors).mean(dim=0)

def collate_fn_ffnn(batch):
    questions, choices, answers = zip(*batch)
    embedded_questions = torch.stack([embed_sequence(q) for q in questions])
    embedded_choices = torch.stack([
        torch.stack([embed_sequence(choice) for choice in choice_list])
        for choice_list in choices
    ])
    return embedded_questions, embedded_choices, torch.tensor(answers)
```

## **8. LSTM Data Pipeline**

**Rationale:**

- I use full padded sequences of word embeddings for input.
- Prepares input suitable for recurrent models (e.g., LSTM).

```python
def embed_tokens(tokens):
    return torch.stack([get_word_vector(token) for token in tokens]) if tokens else torch.zeros((1, 300))

def collate_fn_lstm(batch):
    questions, choices, answers = zip(*batch)
    embedded_questions = torch.nn.utils.rnn.pad_sequence(
        [embed_tokens(q) for q in questions], batch_first=True
    )
    embedded_choices = torch.stack([
        torch.nn.utils.rnn.pad_sequence([embed_tokens(choice) for choice in choice_list], batch_first=True)
        for choice_list in choices
    ])
    return embedded_questions, embedded_choices, torch.tensor(answers)
```

## **9. Initialize WandB Tracking**

**Rationale:**

- WandB helps track hyperparameters, losses, and accuracy during training.
- Enables experiment comparison and makes training behavior reproducible.
- The architecture type ("ffnn" or "lstm") is also logged in the configuration to support automatic routing in data loading and model selection.

```python
wandb.init(project="commonsense_qa", name="ffnn_baseline")
wandb.config.update({
    "architecture": "ffnn",
    "input_dim": 300,
    "hidden_dim": 128,
    "output_dim": 5,
    "learning_rate": 0.001,
    "batch_size": 16
})
```

## **10. Define get_dataloaders Function**

**Rationale:**

- Automatically chooses the correct collate function based on the model architecture defined in the WandB configuration.
- Avoids manual mode switching or hardcoding.
- Enables consistent data access for both training and evaluation loops.

```python
def get_dataloaders(batch_size=None):
    batch_size = batch_size or wandb.config["batch_size"]
    mode = wandb.config.get("architecture", "ffnn")
    collate = collate_fn_ffnn if mode == "ffnn" else collate_fn_lstm

    train_dataset = CommonsenseQADataset("train")
    val_dataset = CommonsenseQADataset("validation")

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate)

    return train_loader, val_loader
```

## **11. Pipeline Test – FFNN**

**Rationale:**

- Ensures the FFNN collate function works and produces expected shapes.

```python
train_loader, val_loader = get_dataloaders(batch_size=wandb.config["batch_size"], mode="ffnn")
for batch in train_loader:
    questions, choices, labels = batch
    print("[FFNN] Sample question shape:", questions.shape)
    print("[FFNN] Sample choices shape:", choices.shape)
    print("[FFNN] Label:", labels[0])
    break
```

## **12. Pipeline Test – LSTM**

**Rationale:**

- Verifies padding and batching for the LSTM collate function.

```python
train_loader, val_loader = get_dataloaders(batch_size=wandb.config["batch_size"], mode="lstm")
for batch in train_loader:
    questions, choices, labels = batch
    print("[LSTM] Sample question shape:", questions.shape)
    print("[LSTM] Sample choices shape:", choices.shape)
    print("[LSTM] Label:", labels[0])
    break


for batch in train_loader:
    questions, choices, labels = batch
    print("Sample padded question:", questions[0])
    print("Sample padded choices:", choices[0])
    print("Sample label:", labels[0])
    break
```

## **13. Model Architecture – Two-Layer Feedforward Classifier**

**Rationale:**

- This architecture is based on the first model specified in the project description.
- I use a simple two-layer fully connected network to classify questions using pre-trained word embeddings.
- A ReLU activation function introduces non-linearity, which improves the model’s ability to learn complex patterns.
- This simpler model serves as a baseline before introducing recurrent components (e.g., LSTM/GRU) in future steps.

```python
class FeedforwardClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedforwardClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

INPUT_DIM = 300
HIDDEN_DIM = 128
OUTPUT_DIM = 5
model = FeedforwardClassifier(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM)
```

## **14. Model Architecture – LSTM Classifier**

**Rationale:**

- This model extends the input representation by modeling temporal relationships between tokens.
- Instead of averaging embeddings, it uses an LSTM layer to process the concatenated question and choice embeddings.
- For each choice, the LSTM receives the question and choice embeddings concatenated together, and outputs a score for that choice.
- The model returns logits over all five choices, which are passed into the loss function for training.

```python
class LSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim):
        super(LSTMClassifier, self).__init__()
        self.lstm = nn.LSTM(input_size=embedding_dim * 2, hidden_size=hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, question_seq, choice_seqs):
                # Expand question tensor to match the shape of the choices
        question_seq = question_seq.unsqueeze(1).expand(-1, 5, -1)
        combined = torch.cat((question_seq, choice_seqs), dim=2)

        outputs = []
        for i in range(combined.size(1)):
            choice_input = combined[:, i, :].unsqueeze(1)
            lstm_out, _ = self.lstm(choice_input)
            output = self.fc(lstm_out[:, -1, :])
            outputs.append(output)

        logits = torch.stack(outputs, dim=1).squeeze(2)
        return logits
```

## **15. Define Loss Function & Optimizer**

**Rationale:**

- `CrossEntropyLoss` is standard for multi-class classification tasks like CommonsenseQA.
- `Adam` optimizer adapts learning rates and usually performs well with minimal tuning.
- The learning rate is stored in WandB config to support reproducible experimentation.

```python
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=wandb.config["learning_rate"])
```

## **16. Training Loop**

**Rationale:**

- This loop trains either the FFNN or LSTM model depending on the current WandB config.
- Tracks training loss and accuracy over epochs.
- Logs results to WandB for performance monitoring.
- Implements Early Stopping to prevent overfitting when validation performance stagnates.
- Saves two model versions: the one with best validation accuracy, and the last seen model in case of interruption.

```python
def train_model(model, train_loader, val_loader, loss_fn, optimizer, epochs=10, patience=3):
    best_val_acc = 0
    patience_counter = 0

    for epoch in range(epochs):
        model.train()
        total_loss, correct, total = 0, 0, 0
        for questions, choices, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(questions, choices)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            preds = torch.argmax(outputs, dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

        train_acc = correct / total
        wandb.log({"train_loss": total_loss / len(train_loader), "train_acc": train_acc})

        # Evaluation
        model.eval()
        val_correct, val_total = 0, 0
        with torch.no_grad():
            for questions, choices, labels in val_loader:
                outputs = model(questions, choices)
                preds = torch.argmax(outputs, dim=1)
                val_correct += (preds == labels).sum().item()
                val_total += labels.size(0)

        val_acc = val_correct / val_total
        wandb.log({"val_acc": val_acc})

        print(f"Epoch {epoch+1}, Train Acc: {train_acc:.4f}, Val Acc: {val_acc:.4f}")

        # Early Stopping
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            patience_counter = 0
            torch.save(model.state_dict(), "best_model.pt")
            artifact = wandb.Artifact("best-model", type="model")
            artifact.add_file("best_model.pt")
            wandb.log_artifact(artifact)
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered.")
                break

        # Save last seen model every epoch
        torch.save(model.state_dict(), "last_model.pt")
        wandb.save("last_model.pt")
```

## **17. Hyperparameter Sweep Configuration**

**Rationale:**

- Hyperparameter sweeps allow automated exploration of multiple configurations.
- We use a `random` search strategy to efficiently test combinations of architecture, hidden size, and learning rate.

```python
sweep_config = {
    "method": "random",
    "metric": {"name": "val_acc", "goal": "maximize"},
    "parameters": {
        "architecture": {"values": ["ffnn", "lstm"]},
        "hidden_dim": {"values": [64, 128, 256]},
        "learning_rate": {"min": 1e-4, "max": 5e-3}
    }
}

sweep_id = wandb.sweep(sweep_config, project="commonsense_qa")

# Training entry point for the sweep

def sweep_train():
    wandb.init()
    config = wandb.config

    model_class = FeedforwardClassifier if config.architecture == "ffnn" else LSTMClassifier
    train_loader, val_loader = get_dataloaders()

    model = model_class(config.input_dim, config.hidden_dim, config.output_dim)
    loss_function = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)

    train_model(model, train_loader, val_loader, loss_function, optimizer)

# To run the sweep:
# wandb.agent(sweep_id, function=sweep_train, count=10)
```

## **18. Evaluation**

**Rationale:**

- This evaluation phase is run after training to assess the final performance of the model on a validation or test set.
- The best saved model (based on validation accuracy) is loaded from disk.
- Final accuracy is computed to summarize model quality.

```python
# Load the best model for final evaluation
def evaluate_model(model_class, input_dim, hidden_dim, output_dim, val_loader):
    model = model_class(input_dim, hidden_dim, output_dim)
    model.load_state_dict(torch.load("best_model.pt"))
    model.eval()

    total, correct = 0, 0
    with torch.no_grad():
        for questions, choices, labels in val_loader:
            outputs = model(questions, choices)
            preds = torch.argmax(outputs, dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    accuracy = correct / total
    print(f"Final Evaluation Accuracy: {accuracy:.4f}")
    wandb.log({"final_eval_accuracy": accuracy})

# Example call:
# evaluate_model(FeedforwardClassifier, 300, 128, 5, val_loader)
```

## **19. Tools Used**

This project relies on the following tools and libraries:

- **PyTorch**: Model building, training, and data utilities
- **Hugging Face Datasets**: Loading CommonsenseQA efficiently
- **NLTK**: Tokenization and text cleaning
- **Gensim**: Pretrained word embeddings (word2vec)
- **Weights & Biases (wandb)**: Logging, hyperparameter tracking, and visualizations

---

📊 **Experiment tracking report:** [View report on WandB](https://wandb.ai/YOUR-USER/YOUR-PROJECT/reports)


SyntaxError: invalid character '–' (U+2013) (4237726314.py, line 1)