# Assignment 4: Recurrent Neural Networks (41 marks total)
### Due: November 19 at 11:59pm (grace period until November 21 at 11:59pm)

### Name:

The goal of this assignment is to apply Recurrent Neural Networks (RNNs) in PyTorch for text data classification.

## Part 1: LSTM

### Step 0: Import Libraries

In [None]:
import torch
from datasets import load_dataset
from collections import Counter
import re
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

In [None]:
import warnings
warnings.filterwarnings(action='ignore')

### Step 1: Data Loading and Preprocessing (12 marks)

For this assignment, we will be using the imdb dataset from the ðŸ¤— Datasets library

In [None]:
# TO DO: Load the dataset (1 mark)


We need to preprocess the data before we can feed it into the model. The first step is to define a custom tokenizer to perform the following tasks: 
- Extract the text data from the dataset
- Remove any non-alphanumeric characters
- Separate each data sample into separate words (tokens)

In [None]:
def tokenizer(data_iter):
    '''Tokenizes the input data
    input: data_iter (type: dictionary)
    output: text (type: list[list[str]])
    '''
    # TO DO: fill in this function (2 marks)
    pass

We will also need to extract the labels from the dataset. Complete the label_extractor function below:

In [None]:
def label_extractor(data_iter):
    '''Takes the label for each data sample and stores it in a separate list
    input: data_iter (type: dictionary)
    output: labels (type: list)
    '''
    # TO DO: fill in this function (1 mark)
    pass

Now that we have the text data separated into words, we need to define the vocabulary. We cannot keep all the words in the vocabulary, so we want to limit the vocabulary size and only take the most common words. In this case, the maximum vocabulary size is 10,000 words. Any word that is excluded will be set to an unknown token. You can use the function below to build the vocabulary:

In [None]:
# Build a vocabulary
def build_vocab(data_iter, max_size=10000):
    '''Creates a vocabulary based on the training data
    input: data_iter (type: list[list[str]])
    output: vocab (type: dictionary)
    '''
    counter = Counter()
    for words in data_iter:
        counter.update(words)
    # Filter to most common words
    vocab = {word: i + 1 for i, (word, _) in enumerate(counter.most_common(max_size))}
    # Add a token for unknown words (0)
    vocab['<unk>'] = 0 
    return vocab

In the vocabulary, each word is mapped to a number in the vocabulary. We will need to encode the dataset based on these numbers, as tensors cannot handle string data.

The next step is to pad or truncate each sequence based on a maximum length, to make sure that the dataset can be transformed into a tensor (as discussed in class).

Fill in the function below to encode and pad the dataset:

In [None]:
def encode_and_pad(text, vocab, max_len=100):
    '''Encode and pad the input text dataset
    input: text (type: list[list[str]])
    input: vocab (type: dictionary)
    input: max_len (type: int)
    output: texts (type: list[list[str]])
    '''
    # TO DO: fill in the function to encode text to integers and pad/truncate sequences (2 marks)
    pass

The next step is to create a custom PyTorch Dataset class that calls the `encode_and_pad()` function and stores the text and labels as tensors. Fill in the `init` portion of the class: 

In [None]:
# Create a custom PyTorch Dataset class
class TextDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_len):
        # TO DO: call the encode_and_pad() function and set self.texts and self.labels (2 marks)
        pass
    def __len__(self): 
        return len(self.labels)
    def __getitem__(self, idx): 
        return self.texts[idx], self.labels[idx]

Now you can call all the functions that have been created:

In [None]:
MAX_LEN = 256 # Sequence length
BATCH_SIZE = 64

# TO DO: Tokenize training data (1 mark)

# TO DO: Extract labels from training and testing data (1 mark)

# TO DO: Build Vocabulary (from training data only) (1 mark)

# TO DO: Prepare datasets (using TextDataset class) and store datasets using DataLoaders (1 mark)


### Step 2: Define Model (4 marks)

For this assignment, we will be using the LSTM model. Inside the LSTM model, the first layer will be an embedding layer, to convert the singular numerical representation of each word into an embedded vector. We can use `nn.Embedding(...)` for this.

Define LSTMClassifier below:

In [None]:
# TO DO: Define LSTM class (4 marks)
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers=1):
        super().__init__()
        # TO DO: Embedding layer 
        pass
        # TO DO: LSTM layer
         
        # TO DO: Linear fully-connected layer
        

    def forward(self, x):
        # TO DO: Fill in the model steps
        # NOTE: The LSTM outputs (output, (hidden, cell)) - hidden and cell are not used
        # NOTE: Use the hidden state from the final time step for the fc layer
        pass

### Step 3: Define Training and Testing Loops (4 marks)

The next step is to define functions for the training and testing loops. For this case, we will only be calculating the loss at each epoch.

In [None]:
# TO DO: Define training loop (2 marks)


In [None]:
# TO DO: Define testing loop (2 marks)


### Step 4: Train and Evaluate (3 marks)

Now that we have all the necessary functions, we can select our hyperparameters, and train and evaluate our model. For this case, since we are not comparing different models, we do not need a validation set.

In [None]:
# Hyperparameters
VOCAB_SIZE = len(vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 1 # Binary classification
NUM_LAYERS = 1

In [None]:
# TO DO: Create model object (1 mark)


In [None]:
import torch.optim as optim

# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

Since this case is binary optimization, we will use the binary cross entropy criterion, `BCEWithLogitsLoss()`. This model is similar to Cross Entropy, but uses a sigmoid layer instead of a softmax layer. For the optimization function, we will use Adam with a learning rate of 0.01.

In [None]:
# TO DO: Define optimization model and criterion (1 mark)


We can now run our training and testing loops. Since this takes a long time to run, we will set the number of epochs to 5. Print out the training and testing losses.

In [None]:
# TO DO: Run training and testing loops and print losses for each epoch (1 mark)


## Part 2: Questions and Process Description

### Questions (12 marks)

1. Do you think this model worked well to classify the data? Why or why not? Can you make a good decision about this only using loss data?
1. What could you do to further improve the results? Provide two suggestions.
1. Why does a simple RNN often underperform compared to LSTM or GRU on long text sequences such as IMDB reviews?
1. Why does the embedding layer improve performance compared to one-hot encoding?
1. If we switched to character-level input instead of word-level, what changes would we expect in performance and training time?
1. How does vocabulary size influence model performance and generalization?

*ANSWER HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 3: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*