<a href="https://colab.research.google.com/github/ucbnlp24/hws4nlp24/blob/main/HW2/HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2: PyTorch and Self-Attention

In this homework, you will begin exploring PyTorch, a neural network library that will be used throughout the remainder of the semester.  

The PDF file for instructions can be found [here](https://github.com/ucbnlp24/hws4nlp24/blob/main/HW2/HW2.pdf).

You can toggle the outline on the left hand side to jump around sections more easily.

**Due date**: Tuesday February 13 at 11:59 PM


## Setup

In [None]:
import numpy as np
import nltk
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import nltk
from collections import Counter

# Sets random seeds for reproducibility
seed=159259
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic=True

In [None]:
!python -m nltk.downloader punkt

When looking up pytorch documentation, it may be useful to know which version of torch you are running.


In [None]:
print(torch.__version__)

## **IMPORTANT**: GPU is not enabled by default

You must switch runtime environments if your output of the next block of code has an error saying "ValueError: Expected a cuda device, but got: cpu"

Go to Runtime > Change runtime type > Hardware accelerator > GPU

In [None]:
device="cuda" if torch.cuda.is_available() else "cpu"
print("Running on {}".format(device))

## Deliverable 1: PyTorch and FFNN

### Data Processing

Let's begin by loading our datasets and the 50-dimensional GLoVE word embeddings.  

In [None]:
!wget https://raw.githubusercontent.com/ucbnlp24/hws4nlp24/main/HW2/train.txt
!wget https://raw.githubusercontent.com/ucbnlp24/hws4nlp24/main/HW2/dev.txt
!wget https://raw.githubusercontent.com/ucbnlp24/hws4nlp24/main/HW2/glove.6B.50d.50K.txt

In [None]:
training_file, dev_file="train.txt", "dev.txt"

In [None]:
labels={'pos': 0, 'neg': 1}

In [None]:
def read_embeddings(filename, vocab_size=50000):
    """
    Utility function, loads in the `vocab_size` most common embeddings from `filename`

    Arguments:
    - filename:     path to file
                    automatically infers correct embedding dimension from filename
    - vocab_size:   maximum number of embeddings to load

    Returns
    - embeddings:   torch.FloatTensor matrix of size (vocab_size x word_embedding_dim)
    - vocab:        dictionary mapping word (str) to index (int) in embedding matrix
    """

    # get the embedding size from the first embedding
    with open(filename, encoding="utf-8") as file:
        word_embedding_dim=len(file.readline().split(" ")) - 1

    vocab={}

    embeddings=np.zeros((vocab_size, word_embedding_dim))
    with open(filename, encoding="utf-8") as file:
        for idx, line in enumerate(file):
            if idx + 2 >= vocab_size:
                break
            cols=line.rstrip().split(" ")
            val=np.array(cols[1:])
            word=cols[0]
            embeddings[idx + 2]=val
            vocab[word]=idx + 2

    # a FloatTensor is a multidimensional matrix
    # that contains 32-bit floats in every entry
    # https://pytorch.org/docs/stable/tensors.html
    return torch.FloatTensor(embeddings), vocab

def get_batches(x, y, xType, batch_size=12):
    batches_x=[]
    batches_y=[]
    for i in range(0, len(x), batch_size):
        batches_x.append(xType(x[i:i+batch_size]))
        batches_y.append(torch.LongTensor(y[i:i+batch_size]))
    return batches_x, batches_y

### Demo: Logistic regression

First, let's code up Logistic Regression in PyTorch so you can see how the general framework works.

#### Average Embedding Representation
Let's train a logistic regression classifier where the input is the average GloVe embedding for all words in a review.

In [None]:
def read_avg_glove_embeddings(filename, vocab, embs, labels):
    """
    Utility function, loads in texts `filename` and looks up the static embeddings

    Arguments:
    - filename:     path to file
    - vocab:        vocab file of e.g. GloVe

    Returns
    - embeddings:   torch.FloatTensor matrix of size (vocab_size x word_embedding_dim)
    - vocab:        dictionary mapping word (str) to index (int) in embedding matrix
    """
    data, data_labels=[], []
    with open(filename) as file:
        for line in file:
            avg_emb=np.zeros(50)
            cols=line.rstrip().split("\t")
            idd=cols[0]
            label=cols[1]
            review=cols[2]
            words=nltk.word_tokenize(review)
            avg_counter=0.
            for word in words:
                word=word.lower()
                if word in vocab:
                    avg_emb += embs[vocab[word]].numpy()
                    avg_counter += 1.
            avg_emb /= avg_counter
            data.append(avg_emb)
            data_labels.append(labels[label])
    return data, data_labels


In [None]:
embs, glove_vocab=read_embeddings("glove.6B.50d.50K.txt")
avg_train_x, avg_train_y=read_avg_glove_embeddings(training_file, glove_vocab, embs, labels)
avg_dev_x, avg_dev_y=read_avg_glove_embeddings(dev_file, glove_vocab, embs, labels)

In [None]:
# ignore "UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow."
# We avoid Dataset etc. objects to simplify the code -- make sure you can understand what each of those lines does!
avg_trainX, avg_trainY=get_batches(avg_train_x, avg_train_y, xType=torch.FloatTensor)
avg_devX, avg_devY=get_batches(avg_dev_x, avg_dev_y, xType=torch.FloatTensor)

### Question 1: PyTorch and embeddings (writeup)

Here, you can see that a PyTorch implementation of logistic regression is offered for your reference.
Study this script carefully: start with how we load and access the GloVe embeddings, and the most critical ingredients of a neural net in PyTorch include a class that defines the architecture (pay attention to the `forward` method), an optimizer (`torch.optim.Adam`), and a loss function, `torch.nn.CrossEntropyLoss()`, which combines the $\mathrm{softmax}$ function `torch.nn.LogSoftmax()` and negative log-likelihood `torch.nn.NLLLoss()` (see [documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).

**A notable difference from HW1 here is that the input of this model is the average of GloVe embeddings for all words in a movie review.
What is the difference between averaged embeddings and BoW? What are the advantages or disadvantages of each?
Discuss in no more than 50 words.**

In [None]:
# Models are usually implemented as classes in PyTorch; you need to pass nn.Module to inherit the base class
class LogisticRegressionClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        # __init__ is what gets called when you instantiate the class
        # and here you need to pass the arguments `input_dim` and `outpt_dim` when you do.
        # As you can see in the next cell, a model instance is instantiated with
        #     logreg=LogisticRegressionClassifier(input_dim, output_dim)
        # and we pass the length of reviews -- avg_trainX[0].shape[1]
        # and the size of labels -- len(labels) as arguments.
        super().__init__()
        # this is how you define a linear layer --
        # remember to add "self" if you want the variable to be visible to the whole model
        # (which we do)
        self.linear=torch.nn.Linear(input_dim, output_dim)
    def forward(self, input):
        # The forward() method defines how your input tensors are processed.
        # Don't change the name!
        # It's conventional to name it forward (useful distinction from backward passes), and
        # in fact you can omit the .forward call and just say
        #    logreg(x)
        # as opposed to
        #    logreg.forward(x)
        h=self.linear(input)
        return h
    def evaluate(self, x, y):
        # we use this method to evaluate model performance on dev sets
        # if this all seems overwhelming, just focus on the first two methods for now.
        # The important bits are self.eval() and no_grad(), which freeze the parameters
        # so they won't keep updating during the evaluation process.
        # (We don't want to cheat so should update the parameters only during training.)
        self.eval()
        corr=0.
        total=0.
        with torch.no_grad():
            for x, y in zip(x, y):
                x, y=x.to(device), y.to(device)
                y_preds=self.forward(x)
                for idx, y_pred in enumerate(y_preds):
                    prediction=torch.argmax(y_pred)
                    if prediction == y[idx]:
                        corr += 1.
                    total += 1
        return corr/total

A complete training loop is here for your reference. There's nothing for you to implement here but you might want to make sure you understand the standard procedure. Remember that you **do not** have to finish training; none of your answers will depend on the model accuracy.

In [None]:
logreg=LogisticRegressionClassifier(avg_trainX[0].shape[1], len(labels)).to(device)
optimizer=torch.optim.Adam(logreg.parameters(), lr=0.001, weight_decay=1e-5)
cross_entropy=nn.CrossEntropyLoss()
losses=[]

num_labels=len(labels)

patience=10
max_dev_accuracy=0
patience_counter=0

for epoch in range(200):
    logreg.train()

    for x, y in zip(avg_trainX, avg_trainY):
        x, y = x.to(device), y.to(device)
        y_pred=logreg.forward(x)
        loss=cross_entropy(y_pred.view(-1, num_labels), y.view(-1))
        losses.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    dev_accuracy=logreg.evaluate(avg_devX, avg_devY)

    # check if the dev accuracy is the best seen so far
    if dev_accuracy > max_dev_accuracy:
        max_dev_accuracy=dev_accuracy
        patience_counter=0

    patience_counter+=1

    if epoch % 5 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
    if patience_counter >= patience:
        print("Stopping training; no improvement on dev data after %s epochs" % patience)
        break

### Question 2: FFNN (TODO)

For this question, we want to add a hidden layer to the logistic regression classifier above. Implement Eqn. (7.13) in J&M SLP3 and let the non-linearity $g$ be $\mathrm{tanh}$. Your implementation should be similar to the `LogisticRegressionClassifier` above. Let's pick $20$ for the size of the hidden layer -- this is provided in the `__init__()` function below (`hidden_dim=20`), so you don't need to worry about it.

Note that in the J&M terminology, a "two-layer" network has one hidden layer, which is what you will be implementing. You should fill in the parts between "`#BEGIN SOLUTION`" and "`#END SOLUTION`"

In [None]:
class FFNN(nn.Module):
    # BEGIN SOLUTION
    def __init__(self, input_dim, output_dim):
        super().__init__()
        hidden_dim=20
        # complete the two lines below
        self.linear = ...
        self.fc= ...
        self.tanh=torch.nn.Tanh()

    def forward(self, input):
        raise NotImplementedError
    # END SOLUTION

    def evaluate(self, x, y):
        self.eval()
        corr=0.
        total=0.
        with torch.no_grad():
            for x, y in zip(x, y):
                x, y=x.to(device), y.to(device)
                y_preds=self.forward(x)
                for idx, y_pred in enumerate(y_preds):
                    prediction=torch.argmax(y_pred)
                    if prediction == y[idx]:
                        corr += 1.
                    total += 1
        return corr/total

In [None]:
ffnn_classifier=FFNN(avg_trainX[0].shape[1], len(labels)).to(device)
optimizer=torch.optim.Adam(ffnn_classifier.parameters(), lr=0.001, weight_decay=0)
cross_entropy=nn.CrossEntropyLoss()
losses=[]

num_labels=len(labels)

patience=30
max_dev_accuracy=0
patience_counter=0

for epoch in range(200):
    ffnn_classifier.train()

    for x, y in zip(avg_trainX, avg_trainY):
        x, y = x.to(device), y.to(device)
        y_pred=ffnn_classifier.forward(x)
        loss=cross_entropy(y_pred.view(-1, num_labels), y.view(-1))
        losses.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    dev_accuracy=ffnn_classifier.evaluate(avg_devX, avg_devY)

    # check if the dev accuracy is the best seen so far
    if dev_accuracy > max_dev_accuracy:
        max_dev_accuracy=dev_accuracy
        patience_counter=0

    patience_counter+=1

    if epoch % 5 == 0:
        print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
    if patience_counter >= patience:
        print("Stopping training; no improvement on dev data after %s epochs" % patience)
        break

## Deliverable 2: Attention

The self-attention mechanism is often thought of as one of the most transformative ideas in modern NLP.
Its full form in Transformer, as introduced in "Attention is All You Need" (NIPS 2017) is rather involved.
This deliverable aims to prepare you for it.

We will start with the simplest form of self-attention: scaled dot-product self-attention. The goal is to try to understand the roles that query, key, and value vectors play in attending to the input sequence:
conceptually, what do they aim to achieve and improve on? How do you code this in Python?


### Question 3: The Concept of Self-attention and Multi-head attention (writeup only)?

#### Question 3.1
To review the concept of self-attention and multi-head attention read section $\S$10.1 of [Chapter 10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) in SLP 3. This will be pages 2-8 in the pdf linked. After reading these sections, answer the questions below. In Question 4 you will implement self-attention and in Question 5 you will implement multi-head attention as part of a model.

> Why is the self-attention mechanism useful in NLP? What are the drawbacks of not using self-attention? Name one similarity and difference between self-attention and multi-head attention.

**Respond in no more than 100 words. Points will be taken off for exccessively long answers.**

#### Question 3.2
Fill in the blanks below based on your understanding of section "10.1.2 Multihead Attention".


> Each head in a self-attention layer is provided with its own set of ____ , ____ and __ matrices: $W^Q$, $W^K$ and $W^V$.




### Intuition of attention

First, let's go through the basics of implementing the steps in attention outside of any model.  We'll do that in Numpy.  Using the example from lecture, let's assume we have three sets of parameters $W^Q$, $W^K$ and $W^V$.

In [None]:
query_key_size=37
input_embedding_size=2

Wq=np.random.rand(input_embedding_size, query_key_size)
Wk=np.random.rand(input_embedding_size, query_key_size)
Wv=np.random.rand(input_embedding_size, input_embedding_size)

print(Wq.shape, Wk.shape, Wv.shape)

Let's also assume we have an input sentence that's 5 tokens long; each token is represented as an embedding of length `input_embedding_size` (here, $2$).  That 5-word sentence, then, is represented as as $5 \times 2$ matrix `sent`.  If we multiply `sent` by $W^Q$, the result is a $5 \times 37$ query matrix.

In [None]:
sent=np.random.rand(5, input_embedding_size)
key=sent @ Wq

print(sent.shape, key.shape)

Now let's also show how to perform the softmax operation on a matrix.  Remember that the softmax function normalizes over a set of values $x = [x_1, \ldots, x_n]$ such that each  $0 \le x_i \le 1$  and the sum of $x$ = 1.  Here, we have a $15 \times 5$ matrix $m$; if we perform the softmax over the columns of $m$ (`axis=1`), each row will sum to 1.

In [None]:
from scipy.special import softmax

test_mamtrix=np.random.rand(15,5)
print(f"**test_mamtrix**:\n {test_mamtrix}")

print('---------------------------------------------------------')

# for a 2D matrix, axis=1 normalizes across the *columns*
output=softmax(test_mamtrix, axis=1)
print(f"**output**:\n {output}")

print('---------------------------------------------------------')
# If we sum along the columns, each row should sum to 1
print(f"**sum of each row**:")
print(np.sum(output, axis=1))


### Question 4: Scaled dot product attention in Numpy (TODO)

(TODO).  From all of this, you have the building blocks for implementing attention (outside of any model).  Do so here by filling out the attention function below.  Recall from lecture that attention given a query vector $Q$, key vector $K$ and value vector $V$ is given by the following equation:

$$\mathrm{Attention}({Q}, {K}, {V}) = \mathrm{softmax}\left(\frac{{Q}{K}^\top}{\sqrt{d_k}}\right){V}$$

You will calculate ${Q}$, ${K}$, ${V}$ within the body of this function. The sole required argument to this function should be a 2D matrix $\in \mathbb{R}^{n~\times~ \textrm{input_embedding_size}}$ for any arbitrary $n$ (that is, corresponding to a sentence of arbitrary length).  It should return a matrix of that same exact size that is the output of that attention process over the input, given the parameters specified below. $d_k$ here is the size of the key vector (`query_key_size=37`).

In [None]:
from math import sqrt

query_key_size=37
input_embedding_size=2

Wq=np.random.rand(input_embedding_size, query_key_size)
Wk=np.random.rand(input_embedding_size, query_key_size)
Wv=np.random.rand(input_embedding_size, input_embedding_size)

print(Wq.shape, Wk.shape, Wv.shape)

In [None]:
def attention(input, input_embedding_size=2, query_key_size=37):
    Wq=np.random.rand(input_embedding_size, query_key_size)
    Wk=np.random.rand(input_embedding_size, query_key_size)
    Wv=np.random.rand(input_embedding_size, input_embedding_size)

    # BEGIN SOLUTION
    raise NotImplementedError
    # END SOLUTION
    assert input.shape == output.shape
    return output

###Intuition of multi-head attention

In this section, let's implement multi-head attention.

These are a set of self-attention layers, with their own set of parameters (Wq, Wk, Wv). Specifically for multi-head attention, there are parallel layers at the same depth in the model. They are helpful to capture a variety of parallel relations in the input sentence. Reference [SLP3 Chapter 10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) Pg.9 & 10, and the equations 10.17, 10.18, and 10.19 for implementation.

## Question 5: Scaled dot product attention in PyTorch (TODO)

Data prep time again:

In [None]:
def read_data_as_embeddings(filename, vocab, labels, save_texts=False):
    """
    Utility function, loads in texts `filename` and looks up the static embeddings

    Arguments:
    - filename:     path to file
    - vocab:        vocab file of e.g. GloVe
    - labels:       label mapping
    - save_texts:   whether to store the original texts

    Returns
    - embeddings:   torch.FloatTensor matrix of size (vocab_size x word_embedding_dim)
    - vocab:        dictionary mapping word (str) to index (int) in embedding matrix
    """

    PAD_INDEX=0             # reserved for padding words
    UNKNOWN_INDEX=1         # reserved for unknown words
    SEP_INDEX=2

    texts=[]
    data, data_labels=[], []

    with open(filename) as f:
        for line in f:
            cols=line.split("\t")
            idd=cols[0]
            label=cols[1]
            review=cols[2]
            tokenized_review=nltk.word_tokenize(review.lower())
            w_int=[]
            for w in tokenized_review:
                if w in vocab:
                    w_int.append(vocab[w])
                else:
                    w_int.append(UNKNOWN_INDEX)
            if len(w_int) < 549:
                w_int.extend([PAD_INDEX] * (549 - len(w_int)))
            if len(w_int) < 550:
                data.append((w_int))
                data_labels.append(labels[label])
            if save_texts:
                texts.append((idd, tokenized_review, label))
    return data, data_labels, texts

attn_embeddings=nn.Embedding.from_pretrained(embs)
attn_train_x, attn_train_y, attn_train_texts=read_data_as_embeddings(training_file, glove_vocab, labels, True)
attn_trainX, attn_trainY=get_batches(attn_train_x, attn_train_y, torch.LongTensor, 1)
attn_dev_x, attn_dev_y, _=read_data_as_embeddings(dev_file, glove_vocab, labels)
attn_devX, attn_devY=get_batches(attn_dev_x, attn_dev_y, torch.LongTensor, 1)

Now it's time to implement that as part of a model.  Here we're going to embed attention within a larger model.  For an input document of, say, $20$ words, each represented by a $100$-dimesional embedding, the input to attention is a $20 \times 100$ matrix; the output from attention is also a $20 \times 100$ matrix.  In this larger model, we're going to average those output embeddings to generate a final document that's a single $100$-dimensional vector; pass through a fully-connected dense layer to make a prediction.

As a extension to an implementation that's similar to ```attention``` function, you'll have to concatenate the outputs from each head and apply a transformation with ```Wo``` matrix (a learnable parameter, just like Wq, Wk, and Wv). Implement the  ```__init__() ``` and ```forward()``` function in ```MultiHeadAttention``` class. Again, Equations 10.17, 10.18, and 10.19 of [SLP3 Chapter 10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) are what we expect you to implement in the ```forward()``` method. [Lecture 5](https://ucbnlp24.github.io/webpage/slides/5_attention.pdf) slides #36 - #40 is useful to look into.   

Test yourself before proceeding (not graded, but always remind yourself of this kind of things): In our GloVe representation of movie review data, each padded review has ? tokens, represented by a ?-dimensional GloVe embedding?

In [None]:
class MultiHeadAttention(nn.Module):
    # BEGIN SOLUTION
    raise NotImplementedError
    # END SOLUTION

class MultiHeadAttentionClassifier(nn.Module):
    def __init__(self, params, pretrained_embeddings):
        super().__init__()

        self.seq_len=params["max_seq_len"]
        self.num_labels=params["label_length"]
        self.query_key_size=params["query_key_size"]
        self.num_heads=params["num_heads"]

        self.embeddings=nn.Embedding.from_pretrained(pretrained_embeddings)
        self.input_embedding_size=self.embeddings.weight.data.shape[1]
        self.attention=MultiHeadAttention(self.input_embedding_size, self.query_key_size, self.num_heads)
        self.softmax=nn.Softmax(dim=1)
        self.fc=nn.Linear(self.seq_len, params["label_length"])


    def forward(self, input):
        x=self.embeddings(input)
        x=self.attention(x)
        x=x.mean(1)
        x=self.fc(x)
        return x.squeeze()

    def evaluate(self, x, y):
        self.eval()
        corr=0.
        total=0.
        with torch.no_grad():
            for x, y in zip(x, y):
                x, y=x.to(device), y.to(device)
                x=x[0]
                y_pred=self.forward(x)
                prediction=torch.argmax(y_pred)
                if prediction == y:
                    corr += 1.
                total+=1
        return corr/total


In [None]:
attnmodel = MultiHeadAttentionClassifier(
    params={
        "max_seq_len": 549,
        "label_length": len(labels),
        "query_key_size": 64,
        "num_heads": 3
    },
    pretrained_embeddings=embs
).to(device)

optimizer=torch.optim.Adam(attnmodel.parameters(), lr=0.001, weight_decay=1e-5)
cross_entropy=nn.CrossEntropyLoss()
losses=[]

num_epochs=15
best_dev_acc = 0.
patience=10

Again, you can execute the code to self-check, but you don't need to finish training.

In [None]:
for epoch in range(num_epochs):
    attnmodel.train()

    for x, y in zip(attn_trainX, attn_trainY):
        x=x[0]
        x, y=x.to(device), y.to(device)
        y_pred=attnmodel.forward(x)
        loss=cross_entropy(y_pred.view(-1, attnmodel.num_labels), y.view(-1))

        losses.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    dev_accuracy=attnmodel.evaluate(attn_devX, attn_devY)

    # check if the dev accuracy is the best seen so far; save the model if so
    print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))
    if dev_accuracy > best_dev_acc:
        torch.save(attnmodel.state_dict(), 'best-attnmodel-parameters.pt')
        best_dev_acc = dev_accuracy
        patience_counter=0

    patience_counter+=1
    if patience_counter >= patience:
        print("Stopping training; no improvement on dev data after %s epochs" % patience)
        break

attnmodel.load_state_dict(torch.load('best-attnmodel-parameters.pt'))
print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))

---

Congrats! You're officially done with this homework -- be sure to check the "How to submit" section in the PDF.