# Assignment 4: Recurrent Neural Networks (41 marks total)
### Due: November 19 at 11:59pm (grace period until November 21 at 11:59pm)

### Name: Hiu Sum Yuen

The goal of this assignment is to apply Recurrent Neural Networks (RNNs) in PyTorch for text data classification.

## Part 1: LSTM

### Step 0: Import Libraries

In [120]:
import torch
from datasets import load_dataset
from collections import Counter
import re
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

In [121]:
import warnings
warnings.filterwarnings(action='ignore')

### Step 1: Data Loading and Preprocessing (12 marks)

For this assignment, we will be using the imdb dataset from the ðŸ¤— Datasets library

In [122]:
# TO DO: Load the dataset (1 mark)
dataset = load_dataset('imdb')

We need to preprocess the data before we can feed it into the model. The first step is to define a custom tokenizer to perform the following tasks: 
- Extract the text data from the dataset
- Remove any non-alphanumeric characters
- Separate each data sample into separate words (tokens)

In [123]:
def tokenizer(data_iter):
    '''Tokenizes the input data
    input: data_iter (type: dictionary)
    output: text (type: list[list[str]])
    '''
    # TO DO: fill in this function (2 marks)
    text_data = []
    for text in data_iter['text']:
        # Remove non-alphanumeric characters and convert to lowercase
        cleaned_text = re.sub(r'[^a-zA-Z0-9\\s]', '', text.lower())
        # Split into tokens (words)
        tokens = cleaned_text.split()
        text_data.append(tokens)
    return text_data

We will also need to extract the labels from the dataset. Complete the label_extractor function below:

In [124]:
def label_extractor(data_iter):
    '''Takes the label for each data sample and stores it in a separate list
    input: data_iter (type: dictionary)
    output: labels (type: list)
    '''
    # TO DO: fill in this function (1 mark)
    return [label for label in data_iter['label']]

Now that we have the text data separated into words, we need to define the vocabulary. We cannot keep all the words in the vocabulary, so we want to limit the vocabulary size and only take the most common words. In this case, the maximum vocabulary size is 10,000 words. Any word that is excluded will be set to an unknown token. You can use the function below to build the vocabulary:

In [125]:
# Build a vocabulary
def build_vocab(data_iter, max_size=20000):
    '''Creates a vocabulary based on the training data
    input: data_iter (type: list[list[str]])
    output: vocab (type: dictionary)
    '''
    counter = Counter()
    for words in data_iter:
        counter.update(words)
    # Filter to most common words
    vocab = {word: i + 1 for i, (word, _) in enumerate(counter.most_common(max_size))}
    # Add a token for unknown words (0)
    vocab['<unk>'] = 0 
    return vocab

In the vocabulary, each word is mapped to a number in the vocabulary. We will need to encode the dataset based on these numbers, as tensors cannot handle string data.

The next step is to pad or truncate each sequence based on a maximum length, to make sure that the dataset can be transformed into a tensor (as discussed in class).

Fill in the function below to encode and pad the dataset:

In [126]:
def encode_and_pad(text, vocab, max_len=300):
    '''Encode and pad the input text dataset
    input: text (type: list[list[str]])
    input: vocab (type: dictionary)
    input: max_len (type: int)
    output: texts (type: list[list[str]])
    '''
    # TO DO: fill in the function to encode text to integers and pad/truncate sequences (2 marks)
    encoded_texts = []
    for tokens in text:
        # Encode tokens to integers
        encoded = [vocab.get(token, vocab['<unk>']) for token in tokens]
        # Pad or truncate
        if len(encoded) < max_len:
            # Pad with zeros
            padded = encoded + [0] * (max_len - len(encoded))
        else:
            # Truncate
            padded = encoded[:max_len]
        encoded_texts.append(padded)
    return encoded_texts

The next step is to create a custom PyTorch Dataset class that calls the `encode_and_pad()` function and stores the text and labels as tensors. Fill in the `init` portion of the class: 

In [127]:
# Create a custom PyTorch Dataset class
class TextDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_len):
        # TO DO: call the encode_and_pad() function and set self.texts and self.labels (2 marks)
        self.texts = torch.tensor(encode_and_pad(texts, vocab, max_len), dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float)
    def __len__(self): 
        return len(self.labels)
    def __getitem__(self, idx): 
        return self.texts[idx], self.labels[idx]

Now you can call all the functions that have been created:

In [128]:
MAX_LEN = 256 # Sequence length
BATCH_SIZE = 64

# TO DO: Tokenize training data (1 mark)
train_text = tokenizer(dataset['train'])
test_text = tokenizer(dataset['test'])
# TO DO: Extract labels from training and testing data (1 mark)
train_labels = label_extractor(dataset['train'])
test_labels = label_extractor(dataset['test'])
# TO DO: Build Vocabulary (from training data only) (1 mark)
vocab = build_vocab(train_text)
# TO DO: Prepare datasets (using TextDataset class) and store datasets using DataLoaders (1 mark)
train_dataset = TextDataset(train_text, train_labels, vocab, MAX_LEN)
test_dataset = TextDataset(test_text, test_labels, vocab, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

### Step 2: Define Model (4 marks)

For this assignment, we will be using the LSTM model. Inside the LSTM model, the first layer will be an embedding layer, to convert the singular numerical representation of each word into an embedded vector. We can use `nn.Embedding(...)` for this.

Define LSTMClassifier below:

In [129]:
# TO DO: Define LSTM class (4 marks)
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers=1):
        super().__init__()
        # TO DO: Embedding layer 
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        # TO DO: LSTM layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        # TO DO: Linear fully-connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # TO DO: Fill in the model steps
        # NOTE: The LSTM outputs (output, (hidden, cell)) - hidden and cell are not used
        # NOTE: Use the hidden state from the final time step for the fc layer
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # Use the hidden state from the final time step
        last_hidden = hidden[-1]
        # Fully connected layer
        output = self.fc(last_hidden)
        # Squeeze the output to remove the extra dimension for BCEWithLogitsLoss
        return output.squeeze(1)  # This changes [batch_size, 1] to [batch_size]

### Step 3: Define Training and Testing Loops (4 marks)

The next step is to define functions for the training and testing loops. For this case, we will only be calculating the loss at each epoch.

In [130]:
# TO DO: Define training loop (2 marks)
def train_model(model, train_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.float().to(device)
        optimizer.zero_grad()
        output = model(data)  # Now output shape is [batch_size]
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

In [131]:
# TO DO: Define testing loop (2 marks)
def test_model(model, test_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.float().to(device)
            
            # Get predictions (already squeezed in model forward)
            logits = model(data)
            
            # Calculate loss
            loss = criterion(logits, target)
            total_loss += loss.item()

            # Convert logits to probabilities and then to binary predictions
            probs = torch.sigmoid(logits)
            preds = (probs >= 0.5).long()

            # Compute accuracy
            correct += (preds == target.long()).sum().item()
            total += target.size(0)

    accuracy = correct / total
    return total_loss / len(test_loader), accuracy


### Step 4: Train and Evaluate (3 marks)

Now that we have all the necessary functions, we can select our hyperparameters, and train and evaluate our model. For this case, since we are not comparing different models, we do not need a validation set.

In [132]:
# Hyperparameters
VOCAB_SIZE = len(vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 1 # Binary classification
NUM_LAYERS = 1

In [133]:
# TO DO: Create model object (1 mark)
model = LSTMClassifier(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, NUM_LAYERS)

In [134]:
import torch.optim as optim

# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

LSTMClassifier(
  (embedding): Embedding(20001, 100, padding_idx=0)
  (lstm): LSTM(100, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

Since this case is binary optimization, we will use the binary cross entropy criterion, `BCEWithLogitsLoss()`. This model is similar to Cross Entropy, but uses a sigmoid layer instead of a softmax layer. For the optimization function, we will use Adam with a learning rate of 0.01.

In [135]:
# TO DO: Define optimization model and criterion (1 mark)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.BCEWithLogitsLoss()

We can now run our training and testing loops. Since this takes a long time to run, we will set the number of epochs to 5. Print out the training and testing losses.

In [136]:
# TO DO: Run training and testing loops and print losses for each epoch (1 mark)
NUM_EPOCHS = 5

for epoch in range(NUM_EPOCHS):
    train_loss = train_model(model, train_loader, optimizer, criterion, device)
    test_loss, test_accuracy = test_model(model, test_loader, criterion, device)
    print(f'Epoch {epoch+1}/{NUM_EPOCHS}, Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')

Epoch 1/5, Train Loss: 0.6932, Test Loss: 0.6932, Test Accuracy: 0.5000
Epoch 2/5, Train Loss: 0.6932, Test Loss: 0.6932, Test Accuracy: 0.5000
Epoch 3/5, Train Loss: 0.6932, Test Loss: 0.6932, Test Accuracy: 0.5000
Epoch 4/5, Train Loss: 0.6932, Test Loss: 0.6931, Test Accuracy: 0.5000
Epoch 5/5, Train Loss: 0.6932, Test Loss: 0.6931, Test Accuracy: 0.5000


## Part 2: Questions and Process Description

### Questions (12 marks)

1. Do you think this model worked well to classify the data? Why or why not? Can you make a good decision about this only using loss data?
    1. With Loss at 0.69 without dropping, the model shows that it is not learning and effectively performing like random select.
2. What could you do to further improve the results? Provide two suggestions.
    1. Use bidirectional LSTM to capture context from both directions.
    2. Use pre-trained word embeddings (like GloVe or Word2Vec) instead of training embeddings from scratch.
3. Why does a simple RNN often underperform compared to LSTM or GRU on long text sequences such as IMDB reviews?
    1. Simple RNNs suffer from learning long-range dependencies in long sequences. LSTMs and GRUs have gating mechanisms (forget, input, output gates) that allow them to better preserve and control information flow over long sequences.
4. Why does the embedding layer improve performance compared to one-hot encoding?
    1. Embedding layers learn dense, continuous vector representations that capture semantic relationships between words, while one-hot encoding creates sparse, high-dimensional vectors with no meaningful relationships between different words. Embeddings also have much lower dimensionality and can generalize better.
5. If we switched to character-level input instead of word-level, what changes would we expect in performance and training time?
    1. Character-level models would have longer sequences, increasing training time significantly. Performance might decrease initially due to the increased complexity of learning from characters. Though we may find use in niche cases in better performance on out-of-vocabulary words.
6. How does vocabulary size influence model performance and generalization?
Larger vocabulary sizes can capture more nuanced language but require more parameters and training data. Smaller vocabularies are more computationally efficient but may lose important semantic information. There's a trade-off - too small and you lose information, too large and you may overfit or require excessive resources.

*ANSWER HERE*
1. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code? DeepSeek
1. In what order did you complete the steps? in order, then fix bugs
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not? why am i wrong, what does this do, what does this mean if i do this.
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful? a lot of troubles with having the model learn, instead of a non learning model. I had to trial and error with debugging.

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 3: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

i disliked that i did not understand much of what the problem was, even though i understand RNN conceptually in lecture. I found this assignment frustrating and confusing because i could write code without error but the model wasn't learning so i knew i was doing something wrong.