#Assignment 2: RNNs for Sentiment Analysis

## Dataset

The IMDB dataset consists of 25K positive and 25K negative movie reviews for training, and a further set of 25K positive and 25K negative reviews for testing (a total of 100K reviews), all sourced from [IMDB](https://www.imdb.com/). The training and testing sets are, of course, disjoint.

It was produced by NLP researchers at Stanford, the goal being to train intelligent text processing systems to classify the reviews as positive or negative (sometimes referred to as "[sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)"), one of the earliest and best-studied NLP tasks.

The dataset and its metadata can be accessed via HuggingFace Datasets [here](https://huggingface.co/datasets/stanfordnlp/imdb).

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

In [2]:
from datasets import load_dataset
imdb = load_dataset("stanfordnlp/imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

This is a wrapper class that subclasses [torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset). The point of doing this is to later further wrap our IMDBDataset inside of a `DataLoader`, which has useful functionalities for quickly loading batches of data.

In [3]:
from torch.utils.data import Dataset

class IMDBDataset(Dataset):

    # any subclass of Dataset must have __init__(self, ...)
    def __init__(self, split='train'):

        # inherit all functionality from Dataset
        super().__init__()

        # check that there is one label for every review
        assert len(imdb[split]['text']) == len(imdb[split]['label'])

        # data structure of our choosing
        self.pairs = list(zip(imdb[split]['text'], imdb[split]['label']))

    # any subclass of Dataset must have __len__(self)
    # x.__len__() is the same as len(x)
    def __len__(self):
        return len(self.pairs)

    # any subclass of Dataset must have __getitem__(self, idx)
    # x.__getitem__(i) is the same as x[i]
    # be careful when writing this method:
    # be sure it accommodates slices! x[i:j]
    def __getitem__(self, idx):
        return self.pairs[idx]

## Model

### Imports for RNN Architecture

- Neural network building blocks: [`torch.nn`](https://pytorch.org/docs/stable/nn.html)
- NN functions: [`torch.nn.functional`](https://pytorch.org/docs/stable/nn.functional.html)
- RNN - combine a batch of padded sequences into a single packed sequence for efficient processing: [`torch.nn.utils.rnn.pack_padded_sequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html)
- RNN - un-combine packed sequences back into a padded batch: [`torch.nn.utils.rnn.pad_packed_sequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html)

In [4]:
import torch
from torch import nn
import torch.nn.functional as F
import torch.nn.utils.rnn as rnn_utils

### RNN architecture

Similar to the CNNs in the last assignment, we must:
- subclass `nn.Module`
- define an `__init__(self, ...)` method
- define a `forward(self, ...)` method

#### A brief note on why we need padding and truncation

Although RNNs can handle arbitrarily long sequences, as well as sequences variable in size, we cannot process a batch of sequences of different lengths all at once.

Consider a batch like:

$$ [0, 4, 8, 9, 27] \\ [0, 2] \\ [0, 3, 5] $$

Let's say we have processed the first two tokens in each sequence using linear algebra operations on the hardware. We still have to process:

$$ [8, 9, 27] \\ [] \\ [5] $$

A generic algorithm for handling matrices and vectors is going to throw an error when it sees that there are empty entries here.

On the other hand, truncation allows us to specify  a maximum length to pad sequences to, while also handling cases where a sequence is longer than that maximum length. We simply "cut off" the sequence beyond that maximum length (this sequence won't actually need any padding).

***TODO: Complete the `__init__()` and `forward()` methods of the following RNN class [10 points].***

In [5]:
class RNN(nn.Module):

    def __init__(
            self,
            vocab_size,
            embedding_dim,
            hidden_size,
            num_layers,
            dropout,
            bidirectional,
            padding_idx
        ):

        # inherit from nn.Module
        super().__init__()

        # we need these for the forward() method
        self.bidirectional = bidirectional
        self.hidden_size = hidden_size

        """EMBEDDING LAYER"""
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size, # TODO: need one embedding for each vocab item
            embedding_dim=embedding_dim, # customizable
            padding_idx=padding_idx # makes sure we treat padding as padding
        )

        """RNN LAYERS (LSTM)"""
        self.lstm = nn.LSTM(
            input_size=embedding_dim, # TODO: input comes from the embedding layer
            hidden_size=hidden_size, # customizable
            num_layers=num_layers, # customizable
            batch_first=True, # generally the right way to order the dims
            dropout=dropout, # probability of dropping a connection in the LSTM
            bidirectional=bidirectional # choice, may help, may hurt
        )

        if not bidirectional: # concat last hidden state w/ last final state
            self.linear = nn.Linear(2 * hidden_size, 1)
        else: # concat first and last hidden states w/ first and last final states
            self.linear = nn.Linear(4 * hidden_size, 1)

        # quick way to count trainable params in the model
        p_count = 0
        for param in self.parameters():
            if param.requires_grad:
                p_count += param.numel()
        print(f'Model initialized with {p_count} trainable parameters.')

    def forward(self, x, lengths):
        #print('input', x.shape) # [batch_size, seq_length]

        """TODO: EMBED THE SEQUENCE"""
        x = self.embedding(x)

        #print('embed', x.shape) # [batch_size, seq_length, embedding_dim]

        """PACK THE SEQUENCE"""
        # see documentation to understand what this does
        x = rnn_utils.pack_padded_sequence(
            x,
            lengths,
            batch_first=True,
            enforce_sorted=False
        )
        # can't print shape here, because x is not a torch.Tensor anymore :/

        """TODO: PASS THE SEQUENCE THROUGH THE RNN"""
        x, (hidden, cell) = self.lstm(x)

        #print('x', x.shape) # [batch_size, seq_length, hidden_size(*2)]
        #print('h', hidden.shape) # [num_layers(*2), batch_size, hidden_size]

        """UNPACK THE SEQUENCE"""
        # this undoes the packing from a few lines ago
        x, _ = rnn_utils.pad_packed_sequence(x, batch_first=True)

        if self.bidirectional:

            """GET THE LAST OUTPUT LEFT-TO-RIGHT, IGNORING PADDING"""
            # range(len(x)) = entire batch, lengths - 1 = end of sequences
            # ignoring padding, :self.hidden_size = LTR
            forward_last_output = x[range(len(x)), lengths - 1, :self.hidden_size]
            #print('fwo', forward_last_output.shape) # [batch_size, hidden_size]

            """GET THE LAST OUTPUT RIGHT-TO-LEFT, IGNORING PADDING"""
            # : = entire batch, 0 = start of sequence, self.hidden_size: = RTL
            backward_last_output = x[:, 0, self.hidden_size:]
            #print('bwo', backward_last_output.shape) # [batch_size, hidden_size]

            """GET THE LAST HIDDEN STATE LEFT-TO-RIGHT"""
            # second-to-last item in dim 0 will be the hidden state from
            # the full LTR pass
            forward_last_hidden = hidden[-2, :, :]
            #print('fwh', forward_last_hidden.shape) # [batch_size, hidden_size]

            """GET THE LAST HIDDEN STATE RIGHT-TO-LEFT"""
            # last item in dim 0 will be the hidden state from the full RTL pass
            backward_last_hidden = hidden[-1, :, :]
            #print('bwh', backward_last_hidden.shape) # [batch_size, hidden_size]

            """CONCATENATE THE LAST OUTPUTS AND HIDDEN STATES"""
            # concat allows us to use information from all four sources
            x = torch.cat((
                forward_last_output,
                backward_last_output,
                forward_last_hidden,
                backward_last_hidden
            ), dim=1)
            #print('cat', x.shape) # [batch_size, hidden_size*4]

        else:

            """GET THE LAST OUTPUT, IGNORING PADDING"""
            # range(len(x)) = entire batch, lengths - 1 = end of sequences
            # ignoring padding, :self.hidden_size = LTR
            last_output = x[range(len(x)), lengths - 1, :self.hidden_size]
            #print('uni out', last_output.shape) # [batch_size, hidden_size]

            """GET THE LAST HIDDEN STATE"""
            # last item in dim 0 will be the hidden state from the full LTR pass
            last_hidden = hidden[-1, :, :]
            #print('uni hidden', last_hidden.shape) # [batch_size, hidden_size]

            """CONCATENATE THE LAST OUTPUT AND HIDDEN STATE"""
            # concat allows us to use information from both sources
            x = torch.cat((
                last_output,
                last_hidden
            ), dim=1)
            #print('uni cat', x.shape) # [batch_size, hidden_size*2]

        """PASS THE OUTPUT THROUGH THE LINEAR LAYER"""
        # .squeeze() removes the extra dimension
        x = self.linear(x).squeeze()
        #print('sys out', x.shape) # [batch_size]

        return x

### Tokenizer

Training a tokenizer yourself is not too painful, but an even easier choice is to load up a pretrained one such as the one used by [BERT](https://bibbase.org/service/mendeley/bfbbf840-4c42-3914-a463-19024f50b30c/file/6375d223-e085-74b3-392f-f3fed829cd72/Devlin_et_al___2019___BERT_Pre_training_of_Deep_Bidirectional_Transform.pdf.pdf). If you're curious about training your own, look into [SentencePiece](https://github.com/google/sentencepiece).

Otherwise, let's use BERT's tokenizer, available on [HuggingFace](https://huggingface.co/google-bert/bert-base-uncased):

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

# take a look at how it splits up words
print('calling tokenizer.encode:')
print(tokenizer.encode('hello my name is slim shady'))

# tokenizer use, can also just say tokenizer(...) rather than tokenizer.__call__(...)
tokens = tokenizer.__call__(
    'hello my name is slim shady', # text to tokenize
    padding='max_length', # pad sequences to the max_length if necessary
    truncation=True, # truncate long sequences if necessary
    max_length=128, # truncate and/or pad to this length
    return_tensors='pt' # give torch.Tensor as the output of tokenizing
)['input_ids'] # there's also two other things in the encoding, we just want the ids

# take a look at the input ids
print('calling the tokenizer with max length padding:')
print(tokens)

# how many tokens in the model's vocabulary?
print('tokenizer\'s vocab size:', tokenizer.vocab_size)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

calling tokenizer.encode:
[101, 7592, 2026, 2171, 2003, 11754, 22824, 102]
calling the tokenizer with max length padding:
tensor([[  101,  7592,  2026,  2171,  2003, 11754, 22824,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
         



## Training and evaluation

### Imports for training and evaluation

- [`torch.optim.Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)
- [`torch.nn.BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)
- [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
- [`time.time`](https://docs.python.org/3/library/time.html#time.time)
- [`sklearn.metrics.classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [7]:
from torch.optim import Adam # optimization algorithm
from torch.nn import BCEWithLogitsLoss # loss function
from torch.utils.data import DataLoader # iterates over dataset in batches
from time import time # time things
from sklearn.metrics import classification_report # evaluation metric

### Hyperparameters

Hyperparameters are things chosen by the programmer, as opposed to parameters, which are optimized by a learning algorithm.

Some hyperparameters we are setting:
- Vocabulary size, which we will just set to be the same as BERT's for convenience
- Embedding size, a choice. It's harder to train/slower/prone to overfitting a larger embedding, but a larger embedding may perform better.
- Hidden size, also a choice with the same issues. Usually smaller than the embedding size.
- Number of layers, also a choice with the same issues.
- Dropout - a little dropout often helps regularize a model and prevents overfitting.
- Bidirectional vs. unidirectional - another trade-off. Bidirectional will be harder to train/slower/prone to overfitting, but may perform better.
- Padding index - which input id is the padding token which should be ignored.
- Batch size - how many inputs to process in parallel, and also, how many inputs to process before updating the model's parameters. A smaller batch size may cause overfitting (over-reliance on just a few examples to update the model), but uses less memory.
- Epochs - how many times you want the model to see the training set. More epochs will train the model better, but can cause overfitting.
- Maximum length - the longest sequence length we will process. A longer sequence length takes longer to process, but captures more of the information in the sequence. For an RNN, longer sequences can also kill performance, because RNNs may "forget" information as they move along the sequence.
- Optimization algorithm - Adam is a popular choice, but other choices may do better. Other common choices are [Stochastic Gradient Descent (SGD)](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) and [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW).
- Learning rate - THE MOST IMPORTANT HYPERPARAMETER - this has been [shown empirically](https://proceedings.mlr.press/v28/bergstra13.html). Typical choices are orders of magnitude smaller than 1 (1e-2, 1e-3, 1e-4...).
- Loss function - kind of a hyperparameter, but really, it determines what you are teaching the model. Binary cross-entropy loss is appropriate for binary classification. BCE will force outputs for the positive class to be larger (greater than 0), and outputs for the negative class to be smaller (less than 0).

***TODO: Adjust the learning rate and number of epochs to train the model to at least 75% accuracy [10 points]. NOTE: You should do this part last.***

In [9]:
# check if GPU is available, and let device be 'cuda' if so, otherwise 'cpu'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# our model
rnn = RNN(
    vocab_size=tokenizer.vocab_size, # how many embeddings we need
    embedding_dim=256, # number of dimensions in an embedding vector
    hidden_size=128, # number of dimensions in each hidden layer
    num_layers=4, # number of RNN layers
    dropout=0.1, # dropout probability in each RNN layer
    bidirectional=True, # if False, only read left-to-right
    padding_idx=tokenizer.pad_token_id # ignore padding tokens in the input
).to(device) # move model onto GPU if available

# number of examples to process in parallel
# we will perform only one optimization step per batch
batch_size = 120

# number of times the model sees the training data
epochs = 5 # TODO: select a better number of epochs

# maximum length of sequences we will process
# longer sequences will be truncated to this length
# shorter sequences will be padded to this length
max_length = 256

# Adam is the most beloved optimizer, 'lr' is the learning rate
optimizer = Adam(rnn.parameters(), lr=1e-4) # TODO: select a better learning rate

# loss function for binary classification
criterion = BCEWithLogitsLoss()

Model initialized with 9395201 trainable parameters.


### Testing a forward pass to make sure shapes are compatible

We are picking a random tensor for both `x` and `lengths` to make sure that the model functions correctly.

While these `x` and `lengths` tensors are randomly generated, they will "look" like real inputs.

For clarity:
- `x`'s entries range between 0 and the vocab size, which is the valid range of input token IDs for the model.
- `x` has shape `(batch_size, max_length)`, simulating a batch of `batch_size` sequences padded/truncated to length `max_length`.
- `x` has datatype `torch.int64`, which is the kind of tensor our tokenizer will produce.
- `x` is on the GPU if it's available.
- `lengths` has entries ranging between 1 and `max_length`, simulating possible "real" sequence lengths, ignoring padding.
- `lengths` has shape `(batch_size,)`, simulating a 1D array of "true" lengths (ignoring padding), one "true" length per item in the batch.
- `lengths` has datatype `torch.int64`, which is the kind of tensor the RNN utilities expect.
- `lengths` is on the CPU, also because it's what the RNN utilities expect.

In [10]:
# test the model's forward() method
rnn(
    x = torch.randint( # random valid x
        0,
        tokenizer.vocab_size,
        (batch_size, max_length),
        dtype=torch.int64,
        device=device
    ),
    lengths = torch.randint( # random valid lengths
        1,
        max_length,
        (batch_size,),
        dtype=torch.int64,
        device='cpu'
    )
)

tensor([0.0637, 0.0510, 0.0566, 0.0468, 0.0575, 0.0526, 0.0554, 0.0607, 0.0506,
        0.0627, 0.0522, 0.0565, 0.0563, 0.0590, 0.0587, 0.0598, 0.0535, 0.0611,
        0.0577, 0.0596, 0.0618, 0.0548, 0.0576, 0.0612, 0.0524, 0.0525, 0.0626,
        0.0561, 0.0546, 0.0609, 0.0542, 0.0561, 0.0547, 0.0612, 0.0501, 0.0582,
        0.0605, 0.0571, 0.0597, 0.0550, 0.0588, 0.0567, 0.0592, 0.0682, 0.0626,
        0.0507, 0.0510, 0.0541, 0.0598, 0.0514, 0.0537, 0.0583, 0.0631, 0.0617,
        0.0519, 0.0572, 0.0542, 0.0599, 0.0548, 0.0704, 0.0587, 0.0528, 0.0596,
        0.0571, 0.0610, 0.0592, 0.0523, 0.0541, 0.0473, 0.0572, 0.0497, 0.0583,
        0.0582, 0.0579, 0.0566, 0.0540, 0.0656, 0.0588, 0.0636, 0.0523, 0.0593,
        0.0553, 0.0589, 0.0556, 0.0557, 0.0634, 0.0551, 0.0560, 0.0602, 0.0564,
        0.0544, 0.0644, 0.0532, 0.0598, 0.0537, 0.0599, 0.0594, 0.0508, 0.0580,
        0.0583, 0.0533, 0.0657, 0.0558, 0.0531, 0.0573, 0.0582, 0.0547, 0.0568,
        0.0603, 0.0583, 0.0567, 0.0526, 

## Dataloaders

A [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) is a convenient wrapper for a dataset. You can iterate over it to access batches of data.
- `dataset` is the dataset you want to wrap.
- `batch_size` is the batch size.
- `shuffle=True` will randomly shuffle the data each time you iterate over the `DataLoader`.
- `num_workers` will spawn additional processes on CPU to load up batches more efficiently.

### A brief note on data size

You may find that it is very difficult to get the model to learn the entire training set, as it's very large. Consider adjusting `NUM_SAMPLES` larger and larger as you experiment. Try to get the model to learn a few tens of thousands of training examples, and evaluate on the same amount of testing data.

In [11]:
OVERFIT = False # test on the train set to check whether the model is learning
NUM_SAMPLES = 10_000 # number of train samples, also number of test samples
PER_CLASS = int(NUM_SAMPLES / 2)

train_set = DataLoader(
    dataset=IMDBDataset('train')[:PER_CLASS] + IMDBDataset('train')[-PER_CLASS:],
    batch_size=batch_size,
    shuffle=True,
    num_workers=1
)

if OVERFIT: # test on the train set to check whether the model is learning
    test_set = train_set
else:
    test_set = DataLoader(
        dataset=IMDBDataset('test')[:PER_CLASS] + IMDBDataset('test')[-PER_CLASS:],
        batch_size=batch_size,
        shuffle=True,
        num_workers=1
    )

### Training loop with evaluation at end of each epoch



***TODO: Complete the training [20 points] and evaluation [10 points] loop below. Comments have been left to help you.***

Note: computing the lengths is tricky. The idea is to count how many tokens in each sequence are NOT padding tokens. For instance, if 0 is our padding index, we might have:

`inputs = [ [1, 2, 0, 0], [5, 8, 3, 0] ]`

Then `lengths = [2, 3]`.

In [12]:
# track loss
train_losses = []

# track time
times = []

# after log_freq batches, print loss and time
log_freq = 20

for e in range(epochs): # epoch loop

    print(f'Epoch {e+1}:')

    # iterate over training data in batches
    for idx, batch in enumerate(train_set):

        start = time()

        # TODO: unpack batch
        texts, labels = batch


        # TODO: tokenize entire batch
        tokens = tokenizer(
                texts,
                padding='max_length',
                truncation=True,
                max_length=max_length,
                return_tensors='pt'
                )['input_ids'].to(device)
        labels = labels.to(device)

        # TODO: determine "true" lengths, ignoring padding
        pad_token_id = tokenizer.pad_token_id
        lengths = (tokens != pad_token_id).sum(dim=1).to('cpu')

        # print(tokens.shape, labels.shape)

        # TODO: reset any gradients from previous batch to 0
        optimizer.zero_grad()

        # TODO: pass through RNN - ensure lengths is on CPU
        outputs = rnn(tokens, lengths)

        # print(outputs.shape, labels.shape)
        # print(outputs.dtype, labels.dtype)

        # TODO: compute the loss between the outputs and the labels
        # labels must be floats and on the same device as the outputs
        labels = labels.float().to(device)
        loss = criterion(outputs, labels)

        # TODO: backpropagation
        loss.backward()


        # TODO: update model parameters
        optimizer.step()


        # track loss
        train_losses.append(loss.item())

        # track time
        times.append(time() - start)

        # track average loss and time
        avg_loss = sum(train_losses) / len(train_losses)
        avg_time = sum(times) / len(times)

        # log message
        if (idx+1) % log_freq == 0:
            msg =  f'\tBatch: {idx+1:04}, '
            msg += f'Loss: {train_losses[-1]:.4f}, Avg Loss: {avg_loss:.4f}, '
            msg += f'Time: {times[-1]:.4f}, Avg Time: {avg_time:.4f}'
            print(msg)

    # predictions over test set
    test_preds = []

    # true labels over test set
    test_labels = []

    # turn off gradient computation to save time - we are not updating here
    with torch.no_grad():

        print(f'Evaluating after epoch {e+1}:')

        # iterate over test set
        for idx, batch in enumerate(test_set):

            # TODO: unpack batch
            texts, labels = batch


            # TODO: tokenize entire batch
            tokens = tokenizer(
                texts,
                padding='max_length',
                truncation=True,
                max_length=max_length,
                return_tensors='pt'
                )
            input_ids = tokens['input_ids'].to(device)
            labels = labels.to(device)

            # TODO: determine "true" lengths, ignoring padding
            pad_token_id = tokenizer.pad_token_id
            lengths = (input_ids != pad_token_id).sum(dim=1).to('cpu')

            # TODO: pass through RNN - ensure lengths is on CPU
            outputs = rnn(input_ids, lengths)

            # make predictions based on model outputs
            for output in outputs:
                if output > 0: # BCE trained this output to be positive
                    test_preds.append(1) # associated to positive label
                else: # BCE trained this output to be negative
                    test_preds.append(0) # associated to negative label

            # store the corresponding labels
            for label in labels.cpu():
                test_labels.append(label.item())

    # evaluation
    print(classification_report(test_labels, test_preds))

Epoch 1:
	Batch: 0020, Loss: 0.6918, Avg Loss: 0.6929, Time: 0.2570, Avg Time: 0.2946
	Batch: 0040, Loss: 0.6925, Avg Loss: 0.6932, Time: 0.2612, Avg Time: 0.2904
	Batch: 0060, Loss: 0.6935, Avg Loss: 0.6931, Time: 0.2523, Avg Time: 0.2799
	Batch: 0080, Loss: 0.6882, Avg Loss: 0.6926, Time: 0.3212, Avg Time: 0.2819
Evaluating after epoch 1:
              precision    recall  f1-score   support

           0       0.51      0.98      0.67      5000
           1       0.71      0.05      0.10      5000

    accuracy                           0.52     10000
   macro avg       0.61      0.52      0.38     10000
weighted avg       0.61      0.52      0.38     10000

Epoch 2:
	Batch: 0020, Loss: 0.6819, Avg Loss: 0.6918, Time: 0.2456, Avg Time: 0.2751
	Batch: 0040, Loss: 0.6750, Avg Loss: 0.6900, Time: 0.3601, Avg Time: 0.2759
	Batch: 0060, Loss: 0.6796, Avg Loss: 0.6862, Time: 0.2703, Avg Time: 0.2751
	Batch: 0080, Loss: 0.6125, Avg Loss: 0.6800, Time: 0.2674, Avg Time: 0.2725
Evaluating af