# Assignment 3: Build a Feedforward Neural Net for Sentiment Classification (100 Points)

Instructor: Ziyu Yao; Class: CS478 Fall 2024

Release on Sept 18, 2024

**Assignment Overview:** This assignment will guide you to use the PyTorch library (https://pytorch.org/) to construct a feedforward neural net (FFNN) for sentiment classification. We will use the movie review dataset of Socher et al. (2013). The original dataset has fine-grained sentiment labels (e.g., highly negative vs. negative), but we will consider a simplified task version with only two class labels, i.e., positive (1) and negative (0). Therefore, this will be a binary classification task.

This assignment consists of six parts and shoud be submitted in three checkpoints (check out the PDF for instructions):

_Checkpoint 1, from Part 1 to Part 2 (Due on Sept 30)_
- **Part 1 (2 Points):** Get to know PyTorch;
- **Part 2 (3 Points):** Data structure and loading;

_Checkpoint 2, from Part 3 to Part 4 (Due on Oct 9)_
- **Part 3 (5 Points):** FFNN model construction;
- **Part 4 (3 Points):** Sentiment classifier training and evaluation;

_Checkpoint 3, from Part 5 to Part 6 (Due on Oct 16)_
- **Part 5 (4 Points):** Exploration of FFNN hyper-parameters;
- **Part 6 (3 Points):** Exploration of the pre-trained GloVe word embedding.

**Tutorial for Beginners:** If you are new to PyTorch, you are highly recommended to watch the tutorial: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html.

## Part 0: PyTorch Installation
Install the PyTorch library:

In [1]:
pip install torch

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Part 1: Get to Know PyTorch (10 Points)

To help you kick start this assignment, here are some small practices about PyTorch basics.

Let's first import the PyTorch library:

In [2]:
import torch
from torch import nn, optim

  from .autonotebook import tqdm as notebook_tqdm


Consider two 2D tensors, $\mathrm{a}$ and $\mathrm{b}$, which are parameters of a neural network (NN). The NN is defined as $\mathrm{f}(\mathrm{a}, \mathrm{b}) = 3\mathrm{a}^3 - \mathrm{b}^2$.

All operations here are element-wise. Given the 2D tensors, the function can be expressed equivalently in the following form:
$$
\begin{pmatrix} f_0 \\ f_1 \end{pmatrix} = 3 \begin{pmatrix} a_0^3 \\ a_1^3 \end{pmatrix} - \begin{pmatrix} b_0^2 \\ b_1^2 \end{pmatrix}
$$
where
$$
\mathrm{a} = \begin{pmatrix} a_0 \\ a_1 \end{pmatrix}, \: \mathrm{b} = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}, \: \mathrm{f} = \begin{pmatrix} f_0 \\ f_1 \end{pmatrix}
$$

We also define a loss function $l = f_0 + 2 f_1$. Note that $l$ is a scalar.

Q1: Now given the following two tensors $\mathrm{a}$ and $\mathrm{b}$ (`requires_grad=True` for tracking their gradients),

In [3]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

can you implement the forward pass of the neural network function and its loss, and show their values?

<font color='blue'>YOUR TASK: Please complete the following code block and answer the question.</font>

In [7]:
# TODO
f = (3 * torch.pow(a, 3)) - torch.pow(b, 2)
l = f[0] + (2 * f[1])

print("f:", f)
print("l:", l)

f: tensor([-12.,  65.], grad_fn=<SubBackward0>)
l: tensor(118., grad_fn=<AddBackward0>)


Q2: Given the values of `a` and `b` and the loss `l`, what will be the gradients of them?

<font color='blue'>YOUR TASK: Please write down the formula of each partial derivative and complete the calculation below. The example of $\frac{\partial l}{\partial a_0}$ is given.</font>

$\frac{\partial l}{\partial a_0} = 9 a_0^2 = 36$, $\frac{\partial l}{\partial a_1} = \text{REPLACE ME}$, $\frac{\partial l}{\partial b_0} = \text{REPLACE ME}$, $\frac{\partial l}{\partial b_1} = \text{REPLACE ME}$


Q3: Now, try to use PyTorch's `autograd` to calculate the gradients automatically.

<font color='blue'>YOUR TASK: Please complete the following code block and answer the question.</font>

In [9]:
# TODO: apply the autograd function to the loss `l` to calculate the gradients of `a` and `b`

l.backward()

print("gradient of `a`:", a.grad)
print("gradient of `b`:", b.grad)

gradient of `a`: tensor([ 36., 162.])
gradient of `b`: tensor([-12., -16.])


_If your answers to Q2 and Q3 are both correct, you should exactly the same calculation results from them._

**Up to now, you've gained some sense about PyTorch. From Part 2, we will start building the Feedforward Neural Net (FFNN) for sentiment classification. Before it, let's load the following libraries.**

In [10]:
from typing import List, Dict
import random
import numpy as np
from collections import Counter
import os

# Set up overall seed
seed = 12345
random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

## Part 2: Data Structure and Loading (15 Points)

**Step 1:** We will first define the data structure for storing the sentiment data. _There's nothing to fill out from your side; however, you should try to understand the data implementation._

In [11]:
class SentimentExample:
    """
    Data wrapper for a single example for sentiment analysis.

    Attributes:
        words (List[string]): list of words
        label (int): 0 or 1 (0 = negative, 1 = positive)
        word_indices (List[int]): list of word indices in the vocab, which will generated by the `indexing_sentiment_examples` method
    """

    def __init__(self, words, label):
        self.words = words
        self.label = label
        self.word_indices = None # the word indices in vocab

    def __repr__(self):
        return repr(self.words) + "; label=" + repr(self.label)

    def __str__(self):
        return self.__repr__()

Essentially, this `SentimentExample` class defines each data example in a sentiment dataset to have three attributes: the word tokens in the sentence, the sentiment label (1 or 0), and the word indices in the vocabulary (which has not been defined yet).

To assist the data loading the writing, we also define the following help functions:

In [12]:
def indexing_sentiment_examples(exs: List[SentimentExample], vocabulary: List[str], UNK_idx: int):
    """
    Indexing words in each SentimentExample based on a given vocabulary. This method will directly modify the `word_indices` attribute of each ex.
    :param exs: a list of SentimentExample objects
    :param vocabulary: the vocabulary, which should be a list of words
    :param UNK_idx: the index of UNK token in the vocabulary
    """
    for ex in exs:
        ex.word_indices = [vocabulary.index(word) if word in vocabulary else UNK_idx for word in ex.words]

def read_sentiment_examples(infile: str) -> List[SentimentExample]:
    """
    Reads sentiment examples in the format [0 or 1]<TAB>[raw sentence]; tokenizes and cleans the sentences and forms
    SentimentExamples. Note that all words have been lowercased.

    :param infile: file to read from
    :return: a list of SentimentExamples parsed from the file
    """
    f = open(infile)
    exs = []
    for line in f:
        if len(line.strip()) > 0:
            line = line.strip()
            fields = line.split("\t")
            if len(fields) != 2:
                fields = line.split()
                label = 0 if "0" in fields[0] else 1
                sent = " ".join(fields[1:])
            else:
                # Slightly more robust to reading bad output than int(fields[0])
                label = 0 if "0" in fields[0] else 1
                sent = fields[1]
            sent = sent.lower() # lowercasing
            tokenized_cleaned_sent = list(filter(lambda x: x != '', sent.rstrip().split(" ")))
            exs.append(SentimentExample(tokenized_cleaned_sent, label))
    f.close()
    return exs


def read_blind_sst_examples(infile: str) -> List[SentimentExample]:
    """
    Reads the blind SST test set, which just consists of unlabeled sentences. Note that all words have been lowercased.
    :param infile: path to the file to read
    :return: list of tokenized sentences (list of list of strings)
    """
    f = open(infile, encoding='utf-8')
    exs = []
    for line in f:
        if len(line.strip()) > 0:
            line = line.strip()
            words = line.lower().split(" ")
            exs.append(SentimentExample(words, label=-1)) # pseudo label -1
    return exs


def write_sentiment_examples(exs: List[SentimentExample], outfile: str):
    """
    Writes sentiment examples to an output file with one example per line, the predicted label followed by the example.
    Note that what gets written out is tokenized.
    :param exs: the list of SentimentExamples to write
    :param outfile: out path
    :return: None
    """
    o = open(outfile, 'w')
    for ex in exs:
        o.write(repr(ex.label) + "\t" + " ".join([word for word in ex.words]) + "\n")
    o.close()

Now, load the training, dev, and test sets:

In [13]:
DATA_PATH = "data"
# TODO: In case you are using Colab, uncommenting the following few lines of code
#   to mount your Google Drive; refer to Assignment 2 for guide.
# from google.colab import drive
# drive.mount('/content/drive')
# DATA_PATH = "/content/drive/My Drive/CS478/data"


# Specify the data paths
train_path = os.path.join(DATA_PATH, "train.txt")
dev_path = os.path.join(DATA_PATH, "dev.txt")
blind_test_path = os.path.join(DATA_PATH, "test-blind.txt") # blind test

# Load train, dev, and test exs and index the words.
train_exs = read_sentiment_examples(train_path)
dev_exs = read_sentiment_examples(dev_path)
test_exs_words_only = read_blind_sst_examples(blind_test_path)
print(repr(len(train_exs)) + " / " + repr(len(dev_exs)) + " / " + repr(len(test_exs_words_only)) + " train/dev/test examples")

6920 / 872 / 1821 train/dev/test examples


Let's see what's inside `train_exs`:

First, the number of examples (i.e., annotated sentences with sentiment labels) contained in `train_exs`:

In [14]:
len(train_exs)

6920

Each data example can be assessed using index, e.g., accessing the first data example in train_exs, you can run `train_exs[0]`. Let's save it as `train_example_0`.

In [15]:
train_example_0 = train_exs[0]

Each data example has a type:

In [16]:
type(train_example_0)

__main__.SentimentExample

In our implementation, each data example is an instance of our defined `SentimentExample` class.

You can access its word tokens (**NOTE: when loading the dataset, we have already performed the sentence tokenization. There's NO NEED to do further preprocessing.**):

In [17]:
print(train_example_0.words)

['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '``', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean-claud', 'van', 'damme', 'or', 'steven', 'segal', '.']


Similarly, you can access its sentiment label (1 for positive and 0 for negative):

In [18]:
print(train_example_0.label)

1


Note that at this moment, we haven't created the vocabulary. Therefore, the `word_indices` of each example is still `None`.

In [19]:
print(train_example_0.word_indices)

None


Now, let's create a vocabulary based on the training set `train_exs`. Similarly as our Assignment 2, we will keep only words with **more than 2** occurrences in our vocabulary, while converting others into a special `UNK` token.

When working with neural nets, we will also add a `PAD` token for padding a batch of data examples.

<font color='blue'>YOUR TASK: Please complete the following code block and for preparing the vocabulary.</font>

In [20]:
# TODO: complete the code for creating a list called `vocab`, which contains a list of distinct words as the vocabulary.
# As instructed, we only consider words occurring more than twice in the vocabulary.

def add_word_count_to_count_dict(word, count_dict, count=None):
    if count_dict is None or word is None:
        return
    if count is not None:
        count_dict[word] = count
    elif word not in count_dict:
        count_dict[word] = 1
    else:
        count_dict[word] += 1

def count_word_types(exs):
    if exs is None:
        return None
    
    #build all counts
    counts = {}
    for example in exs:
        add_word_count_to_count_dict("<s>", counts)
        for word in example.words:
            add_word_count_to_count_dict(word, counts)
        add_word_count_to_count_dict("</s>", counts)
    
    #replace single counts with UNK
    finalCounts = {}
    for word in counts:
        if counts[word] == 1:
            add_word_count_to_count_dict("UNK", finalCounts)
        else:
            finalCounts[word] = counts[word]

    #remove UNK, it will be added later
    if "UNK" in finalCounts:
        finalCounts.pop("UNK")
    return finalCounts

def get_vocab(exs):
    if exs is None:
        return None
    return list(count_word_types(exs).keys())


vocab = get_vocab(train_exs)
assert isinstance(vocab, list)

# Now add the special tokens PAD and UNK
vocab = ["PAD", "UNK"] + vocab
PAD_IDX = 0
UNK_IDX = 1
# Show the vocabulary size:
print("Number of words in the vocabulary:", len(vocab))

Number of words in the vocabulary: 7145


We then index the training, dev, and test set using this vocabulary.  

In [21]:
indexing_sentiment_examples(train_exs, vocabulary=vocab, UNK_idx=UNK_IDX)
indexing_sentiment_examples(dev_exs, vocabulary=vocab, UNK_idx=UNK_IDX)
indexing_sentiment_examples(test_exs_words_only, vocabulary=vocab, UNK_idx=UNK_IDX)

If you check the `word_indices` of each data example now, you should see non-empty contents:

In [22]:
print("The first example in the training set is indexed as:", "\n", train_example_0.word_indices)

The first example in the training set is indexed as: 
 [3, 4, 5, 6, 7, 8, 3, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 11, 19, 7, 20, 21, 22, 23, 24, 25, 26, 27, 28, 1, 29, 1, 30, 31, 1, 32]


**Step 2:** As we said in class, GPUs accelerate the computing via matrix operations. One important concept here is "data batch", i.e., instead of looking at one data example (i.e., one sentence) at a time, we look at a "batch" of examples. As such, all the calculation, such as loss computation, can be done via matrix operations.

To this end, we will define a class called `SentimentExampleBatchIterator` for loading a batch of data from the given dataset.

<font color='blue'>YOUR TASK: Please complete the following code block and implement the correct padding for loading a batch of examples.</font>

For example, given
```
batch_exs = [
    SentimentExample(words=["I", "feel", "happy"], label=1, word_indices=[2,3,4]),
    SentimentExample(words=["I", "feel", "sad"], label=0, word_indices=[2,3,5]),
    SentimentExample(words=["This", "movie", "is", "interesting"], label=1, word_indices=[6,7,8,9])
]
```
your code should generate (reminder: `0` in `batch_inputs` is the PAD index)
```
batch_inputs =
    [[2, 3, 4, 0],
    [2, 3, 5, 0],
    [6, 7, 8, 9]],
batch_lengths = [3, 3, 4],
batch_labels = [1, 0, 1].
```

Tip: since the indexed SentimentExample object already has the indices saved in `word_indices`,
what you need to do is to get them into one matrix (batch_inputs) and add PAD when necessary.

In [24]:
class SentimentExampleBatchIterator:
    """
    A batch iterator which will produce the next batch indexed data.

    Attributes:
        data: a list of SentimentExample objects, which is the source data input
        batch_size: an integer number indicating the number of examples in each batch
        PAD_idx: the index of PAD in the vocabulary
        shuffle: whether to shuffle the data (should set to True only for training)
    """
    def __init__(self, data: List[SentimentExample], batch_size: int, PAD_idx: int, shuffle: bool=True):
        self.data = data
        self.batch_size = batch_size
        self.PAD_idx = PAD_idx
        self.shuffle = shuffle

        self._indices = None
        self._cur_idx = None

    def refresh(self):
        self._indices = list(range(len(self.data)))
        if self.shuffle:
            random.shuffle(self._indices)
        self._cur_idx = 0

    def get_next_batch(self):
        if self._cur_idx < len(self.data): # loop over the dataset
            st_idx = self._cur_idx
            if self._cur_idx + self.batch_size > len(self.data) - 1:
                ed_idx = len(self.data)
            else:
                ed_idx = self._cur_idx + self.batch_size
            self._cur_idx = ed_idx # update
            # retrieve a batch of SentimentExample data
            batch_exs = [self.data[self._indices[_idx]] for _idx in range(st_idx, ed_idx)]

            # jagged_array is a 2D array [[0,0],[0],[0,0,0]]
            # returns [[0,0,pad_token],[0,pad_token,pad_token][0,0,0]]
            def pad_jagged_array(jagged_array, pad_token):
                max_length = max([len(e) for e in jagged_array])
                def pad_inner_array(inner_array):
                    padded_inner_array = [pad_token] * max_length
                    for idx, x in enumerate(inner_array):
                        padded_inner_array[idx] = x
                    return padded_inner_array
                return [pad_inner_array(arr) for arr in jagged_array]

            # TODO: implement the batching process, which returns batch_inputs, batch_lengths, and batch_labels
            batch_lengths = [len(ex.words) for ex in batch_exs]
            batch_inputs = pad_jagged_array([ex.word_indices for ex in batch_exs], self.PAD_idx)
            batch_labels = [ex.label for ex in batch_exs]
            return (torch.tensor(batch_inputs), torch.tensor(batch_lengths), torch.tensor(batch_labels))
        else:
            return None

To test this batch iterator, run the following code and see if you can successfully load in the first four sentences in the training set into two batches, each containing 2 examples.

(Note that this is not the actual batch_size we will use in experiment; we only do this for a sanity check.)

In [26]:
toy_batch_iterator = SentimentExampleBatchIterator(train_exs[:4], batch_size=2, PAD_idx=0, shuffle=False) # hard-coded batch size and PAD_idx
toy_batch_iterator.refresh()

batch_count = 0
batch_data = toy_batch_iterator.get_next_batch()
while batch_data is not None:
    print("Batch %d:" % batch_count)
    batch_inputs, batch_lengths, batch_labels = batch_data
    # project to device
    print(batch_inputs)
    print(batch_lengths)
    print(batch_labels)
    print("-" * 10)

    batch_count += 1
    batch_data = toy_batch_iterator.get_next_batch()


Batch 0:
tensor([[  3,   4,   5,   6,   7,   8,   3,   9,  10,  11,  12,  13,  14,  15,
          16,  17,  18,  11,  19,   7,  20,  21,  22,  23,  24,  25,  26,  27,
          28,   1,  29,   1,  30,  31,   1,  32,   0,   0,   0],
        [  3,  34,  35,  36,  37,  13,   3,  38,  37,   3,  39,  15,  40,   5,
          41,  42,  17,  21,  43,  37,  44,  45,  46,  47,  48,  49,  50,  51,
          11,   1,  52,  37,   1,  53,  11,   1,  32,   0,   0],
        [  1,  54,  55,   1,  21,   1,  37,  56,  57,  21,  58,  59,  60,  28,
          21,  58,  61,  62,   1,   7,   3,  63,  57,  64,   3,  65,  66,  67,
          68,   3,  69,  28,  70,  28,  71,  37,   3,  72,  32],
        [ 73,   3,  74,   5,  75,  76,  77,  32,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [ 78,  30,  46,  79,  80,   1,  81,  82,  37,  83,  11,  84,  85,  13,
           3,  86,  

**Up to now, you've completed loading the datasets! This is your Checkpoint 1.**

## Part 3: FFNN Model Construction (25 Points)

In this part, we will define the FFNN-based sentiment classifier called `FeedForwardNeuralNetClassifier`.

<font color='blue'>YOUR TASK: Please read the in-line comments and fill out the `TODO`s.</font>

In [None]:
class FeedForwardNeuralNetClassifier(nn.Module):
    """
    The Feed-Forward Neural Net sentiment classifier.
    """
    def __init__(self, vocab_size, emb_dim, n_hidden_units):
        """
        In the __init__ function, you will define modules in FFNN.
        :param vocab_size: size of vocabulary
        :param emb_dim: dimension of the embedding vectors
        :param n_hidden_units: dimension of the hidden units
        """
        super(FeedForwardNeuralNetClassifier, self).__init__()
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.n_hidden_units = n_hidden_units

        # TODO: implement a randomly initialized word embedding matrix using nn.Embedding
        # It should have a size of `(vocab_size x emb_dim)`
        self.word_embeddings = None

        # TODO: implement the FFNN architecture using nn functions
        self.classifier = None

        # TODO: the loss function
        self.loss = None

    def forward(self, batch_inputs: torch.Tensor, batch_lengths: torch.Tensor) -> torch.Tensor:
        """
        The forward function, which defines how FFNN should work when given a batch of inputs and their actual sent lengths (i.e., before PAD)
        :param batch_inputs: a torch.Tensor object of size (n_examples, max_sent_length_in_this_batch), which is the *indexed* inputs
        :param batch_lengths: a torch.Tensor object of size (n_examples), which describes the actual sentence length of each example (i.e., before PAD)
        :return the logits of FFNN (i.e., the unnormalized hidden units before sigmoid) of shape (n_examples)
        """
        # TODO: implement
        logits = None
        return logits

    def batch_predict(self, batch_inputs: torch.Tensor, batch_lengths: torch.Tensor) -> List[int]:
        """
        Make predictions for a batch of inputs. This function may directly invoke `forward` (which passes the input through FFNN and returns the output logits)

        :param batch_inputs: a torch.Tensor object of size (n_examples, max_sent_length_in_this_batch), which is the *indexed* inputs
        :param batch_lengths: a torch.Tensor object of size (n_examples), which describes the actual sentence length of each example (i.e., before PAD)
        :return: a list of predicted classes for this batch of data, either 0 for negative class or 1 for positive class
        """
        # TODO: implement
        preds = None
        return preds

## Part 4: Sentiment classifier training and evaluation (15 Points)

In this part, we will start training the FFNN classifier and then evaluate it on the dev set.

<font color='blue'>YOUR TASK: Please read the following code and fill out the `TODO`s.</font>

We will first instantiate a FFNN classifier (a `FeedForwardNeuralNetClassifier` object) with embedding size 300 and hidden size 300:

In [None]:
# TODO: create the FFNN classifier
model = None

You can view the model architecture by "printing" the model:

In [None]:
model

Define the "device" (CPU or GPU) to run the model:

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
model = model.to(device)

If you are running the notebook on the GPU machine, you should see `cuda:0` popped out. This means that there's an available GPU device in your environment.

Next, define an Adam optimizer using `torch.optim.Adam`, setting learning rate to 0.001 and other configs by default.

In [None]:
# TODO: create the optimizer
optimizer = None

Before the training, we still need to set up the evaluation, such that we can monitor the model performance on the dev set:

In [None]:
def evaluate(classifier, exs: List[SentimentExample], return_metrics: bool=False):
    """
    Evaluates a given classifier on the given examples
    :param classifier: classifier to evaluate
    :param exs: the list of SentimentExamples to evaluate on
    :param return_metrics: set to True if returning the stats
    :return: None (but prints output)
    """
    all_labels = []
    all_preds = []

    eval_batch_iterator = SentimentExampleBatchIterator(exs, batch_size=32, PAD_idx=0, shuffle=False) # hard-coded batch size and PAD_idx
    eval_batch_iterator.refresh()
    batch_data = eval_batch_iterator.get_next_batch()
    while batch_data is not None:
        batch_inputs, batch_lengths, batch_labels = batch_data
        # project to device
        batch_inputs = batch_inputs.to(device)
        batch_lengths = batch_lengths.to(device)
        all_labels += list(batch_labels)

        preds = classifier.batch_predict(batch_inputs, batch_lengths=batch_lengths)
        all_preds += list(preds)
        batch_data = eval_batch_iterator.get_next_batch()

    if return_metrics:
        acc, prec, rec, f1 = calculate_metrics(all_labels, all_preds)
        return acc, prec, rec, f1
    else:
        calculate_metrics(all_labels, all_preds, print_only=True)


def calculate_metrics(golds: List[int], predictions: List[int], print_only: bool=False):
    """
    Calculate evaluation statistics comparing golds and predictions, each of which is a sequence of 0/1 labels.
    Returns accuracy, precision, recall, and F1.

    :param golds: gold labels
    :param predictions: pred labels
    :param print_only: set to True if printing the stats without returns
    :return: accuracy, precision, recall, and F1 (all floating numbers), or None (when print_only is True)
    """
    num_correct = 0
    num_pos_correct = 0
    num_pred = 0
    num_gold = 0
    num_total = 0
    if len(golds) != len(predictions):
        raise Exception("Mismatched gold/pred lengths: %i / %i" % (len(golds), len(predictions)))
    for idx in range(0, len(golds)):
        gold = golds[idx]
        prediction = predictions[idx]
        if prediction == gold:
            num_correct += 1
        if prediction == 1:
            num_pred += 1
        if gold == 1:
            num_gold += 1
        if prediction == 1 and gold == 1:
            num_pos_correct += 1
        num_total += 1
    acc = float(num_correct) / num_total
    prec = float(num_pos_correct) / num_pred if num_pred > 0 else 0.0
    rec = float(num_pos_correct) / num_gold if num_gold > 0 else 0.0
    f1 = 2 * prec * rec / (prec + rec) if prec > 0 and rec > 0 else 0.0

    print("Accuracy: %i / %i = %f" % (num_correct, num_total, acc))
    print("Precision (fraction of predicted positives that are correct): %i / %i = %f" % (num_pos_correct, num_pred, prec)
          + "; Recall (fraction of true positives predicted correctly): %i / %i = %f" % (num_pos_correct, num_gold, rec)
          + "; F1 (harmonic mean of precision and recall): %f" % f1)

    if not print_only:
        return acc, prec, rec, f1

Now, complete the following code for training the model for 10 epochs with batch size 32. At the end of each epoch we also evaluate the model on the held-out dev set.

Note: Your training should finish within a reasonable time period (say, up to 10 minutes) even when using CPU. There could be variantions case by case. However, if you see your training takes a much longer time, it can be that 1) your implementation is buggy or 2) your implementation can be significantly optimized.

In [None]:
import time

BATCH_SIZE=32
N_EPOCHS=10

# create a batch iterator for the training data
batch_iterator = SentimentExampleBatchIterator(
    train_exs, batch_size=BATCH_SIZE, PAD_idx=PAD_IDX, shuffle=True)

# training
best_epoch = -1
best_acc = -1
start_time = time.time()
for epoch in range(N_EPOCHS):
    print("Epoch %i" % epoch)

    batch_iterator.refresh() # initiate a new iterator for this epoch

    model.train() # turn on the "training mode"
    batch_loss = 0.0
    batch_example_count = 0
    batch_data = batch_iterator.get_next_batch()
    while batch_data is not None:
        batch_inputs, batch_lengths, batch_labels = batch_data
        # project to the device
        batch_inputs = batch_inputs.to(device)
        batch_lengths = batch_lengths.to(device)
        batch_labels = batch_labels.to(device)

        # TODO: clean up the gradients for this batch

        # TODO: call the model and get the loss
        loss = None

        # record the loss and number of examples, so we could report some stats
        batch_example_count += len(batch_labels)
        batch_loss += loss.item() * len(batch_labels)

        # TODO: backpropagation

        # get another batch
        batch_data = batch_iterator.get_next_batch()

    print("Avg loss: %.5f" % (batch_loss / batch_example_count))

    # evaluate on dev set
    model.eval() # turn on the "evaluation mode"
    acc, _, _, _ = evaluate(model, dev_exs, return_metrics=True)
    if acc > best_acc:
        best_acc = acc
        best_epoch = epoch
        print("Secure a new best accuracy %.3f in epoch %d!" % (best_acc, best_epoch))

        # Save the current best model parameters
        print("Save the best model checkpoint as `best_model.ckpt`!")
        torch.save(model.state_dict(), "best_model.ckpt")

    print("Time elapsed: %s" % time.strftime("%Hh%Mm%Ss", time.gmtime(time.time()-start_time)))
    print("-" * 10)

print("End of training! The best accuracy %.3f was obtained in epoch %d." % (best_acc, best_epoch))
# Load back the best checkpoint on dev set
model.load_state_dict(torch.load("best_model.ckpt", weights_only=True))

**If your code runs well up to this point -- Congrats! You've completed the model training.**

The following code will evaluate your model on the blind test set, with results saved in the folder `data_to_submit`.

In [None]:
import os
path = os.path.join(DATA_PATH, "data_to_submit")
if not os.path.exists(path):
    os.mkdir(path)

all_preds = [] # save the prediction results

# iterator to load the test set
eval_batch_iterator = SentimentExampleBatchIterator(test_exs_words_only, batch_size=32, PAD_idx=PAD_IDX, shuffle=False)
eval_batch_iterator.refresh()
batch_data = eval_batch_iterator.get_next_batch()
while batch_data is not None:
    batch_inputs, batch_lengths, _ = batch_data
    # project to device
    batch_inputs = batch_inputs.to(device)
    batch_lengths = batch_lengths.to(device)

    preds = model.batch_predict(batch_inputs, batch_lengths=batch_lengths) # the `preds` shoud be a list of prediction labels
    all_preds += preds # accumulate the labels
    batch_data = eval_batch_iterator.get_next_batch()

# write the predicted labels along with its original sentence
test_output_path = os.path.join(DATA_PATH, "data_to_submit/test-blind.output.txt")
test_exs_predicted = [SentimentExample(ex.words, all_preds[ex_idx]) for ex_idx, ex in enumerate(test_exs_words_only)]
write_sentiment_examples(test_exs_predicted, test_output_path)

You've completed Part 4! Please don't forget to submit the `test-blind.output.txt` file along with your code to Blackboard!

## Part 5: Exploration of FFNN hyper-parameters (20 Points)

Q5.1: The previous part has implemented the FFNN with both the embedding size and the hidden size as 300. Does it make any difference with a smaller embedding or hidden size?

<font color='blue'>YOUR TASK: Please change `emb_dim` and `n_hidden_units` to other numbers, re-do all the experiment steps (**make sure to create a new model and a new optimizer for every experiment**), and report the results and your findings.</font>

Here, you will report, for each configuration, the model accuracy on dev set and the epoch when it achieves the best accuracy (which allows to keep track of the model convergence speed). For a fair comparison, all experiments here should use the same batch size 32 and max number of epochs 10.


To report your results, please fill out the table In the PDF.

Q5.2: Similarly, conduct experiments with different learning rates (but still use the Adam optimizer). Same as before, for a fair comparison, all experiments in this table should adopt the same configurations (other than the learning rate). To fill out this table, you can use emb_dim = n_hidden_units = 300, batch_size = 32, and max number of epochs = 10 as before.

<font color='blue'>YOUR TASK: Please change the Adam learning rate to other numbers, re-do all the experiment steps (**again, make sure you create a new model and a new optimizer**), and report the results and your findings.</font>

To report your results, please fill out the table in the PDF.


Q5.3: What do you observe from these two experiments and why do you think the observation should happen?

<font color='blue'>YOUR TASK: Describe your answer in the PDF.</font>

## Part 6: Exploration of the Pre-trained GloVe Word Embedding (15 Points)

The previous parts initialized the word embedding matrix randomly. Does it make any difference if we use pre-trained word embeddings such as GloVe? In the last part of this assignment, we will explore the use of GloVe.

First, download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ and save it under the assignment folder. Choose the version `Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip`. This process will take time.

**Skip the following code if you have already downloaded the file manually.**

In [None]:
!wget https://nlp.stanford.edu/data/glove.42B.300d.zip # do not run it for multiple time

Then extract the file `glove.42B.300d.txt` from the folder. The txt file contains word and embedding pairs.

**Skip the following code if you have already extracted the file manually.**

In [None]:
!unzip -o glove.42B.300d.zip

The following function can help read in the vectors.

In [None]:
def read_glove_pretrained_embeddings(path_to_glove_txt: str, vocab: set):
    vec_dim = 300
    word2vec = {}
    with open(path_to_glove_txt, "r") as f:
        for line in f.readlines():
            line = line.strip()
            space_idx = line.find(' ')
            word = line[:space_idx]
            if word in vocab: # only words in the vocab will be considered
                vec = np.array(line[space_idx+1:].split()).astype(float)
                word2vec[word] = vec

    return word2vec

While the entire GloVe is very large (containing embeddings for 1.9M words), we do not need all of them. Rather, we only need embeddings for words appearing in our sentiment classification dataset (for others, they won't be needed anyways, right?:)).

In Part 2 when we created the vocabulary, we included only words with more than 2 occurrences, because a neural net is not likely to learn anything useful when a word appears only once in the training corpus. However, with GloVe, we want to reconsider those infrequent words because GloVe likely has learned about their semantics from its pre-training corpus.

Therefore, our second step is to re-create a vocabulary, covering all words in the training, dev, and test set.

In [None]:
vocab_counter = Counter([word for ex in train_exs + dev_exs + test_exs_words_only for word in ex.words])
vocab = [word for word, _ in vocab_counter.most_common()]
vocab = ["PAD", "UNK"] + vocab
print("Vocab size:", len(vocab))

Now, we will write code to extract a subset of the word embeddings for words coverred in this new vocabulary.

In [None]:
glove_word2vec = read_glove_pretrained_embeddings("glove.42B.300d.txt", set(vocab))

glove_word_embeddings = []
for word in vocab:
    if word in glove_word2vec:
        glove_word_embeddings.append(glove_word2vec[word])
    else:
        glove_word_embeddings.append(np.zeros(300, dtype=float)) # zero vectors for PAD/UNK/words not covered by glove

glove_word_embeddings = torch.Tensor(np.array(glove_word_embeddings)).to(device)

<font color='blue'>YOUR TASK: Please read the following code and fill out the `TODO` for initializing FFNN with `glove_word_embeddings`. Keep the embedding parameters tunable.</font>

Hint: You can use `nn.Embedding.from_pretrained` from PyTorch

In [None]:
class FeedForwardNeuralNetClassifierwGlove(FeedForwardNeuralNetClassifier):
    def __init__(self, vocab_size, emb_dim, n_hidden_units):
        super().__init__(vocab_size, emb_dim, n_hidden_units)

        # TODO: implement the use of pre-trained word embeddings
        self.word_embeddings = None

The above code created a child class of `FeedForwardNeuralNetClassifier` with the only difference lying in the initialization of the word embeddings. To run experiments using this new classifier, we re-initialize the `model` as an instance of this new classifier:

In [None]:
model = FeedForwardNeuralNetClassifierwGlove(vocab_size=len(vocab), emb_dim=300, n_hidden_units=300)

Then, please rerun the steps in Part 4 for model training (learning rate 0.001) and evaluation. Note that this training will take a longer time (in my record, 3-4minutes) as the embedding matrix is now much larger than before. If you have not optimized your model implementation code, better do it before this training.

Report the new accuracy and the epoch giving the best accuracy in the PDF.

What do you observe and why could it happen?

<font color='blue'>Describe your findings and give your answer in the PDF.</font>

## You have completed this assignment!