### Word Embeddings

Consider the limited vocabulary below

In [1]:
vocab = ["the", "quick", "brown", "sly", "fox", "jumped", "over", "a", "lazy", "dog", "and","found","lion"]
print(len(vocab))

13


Write a function to create one hot encodings of the words which maps each word to a vector where it's location in the vocab list is 1 and all other entries are zero. For example "quick" should map to a torch tensor of dimension 1 with entries 0,1,0.... Create an extra category for words not in the vocabulary

In [6]:
import torch

def one_hot_embedding(token, vocab):
  """
  Token should be a list of words or an indvidual word of length W.
  The output shouild be a torch tensor fo size W x (V+1) which gives the one hot encoding for all W tokens
  """
  vocab_size = len(vocab)
  one_hot_matrix = torch.zeros(len(token), vocab_size + 1)

  for i, word in enumerate(token):
      if word in vocab:
          one_hot_matrix[i, vocab.index(word)] = 1

      else:
          # extra category for words not in the vocabulary
          one_hot_matrix[i, -1] = 1

  return one_hot_matrix

Create a nn.module that takes in a single sentence, encodes the words as embeddings and averages them. The input should be a python list and the output a torch vector of size $D$. For each word you will encode it as a $D$ dimensional vector and average the final embeddings.

In [11]:
import torch.nn as nn

class MyWordEmbeddingBag(nn.Module):
    def __init__(self, vocab, dim):
        super(MyWordEmbeddingBag, self).__init__()
        self.embedding = nn.EmbeddingBag(len(vocab) + 1, dim, sparse=True)

    def forward(self, input_list):
        indices = torch.argmax(torch.tensor(one_hot_embedding(input_list, vocab)), dim=1)
        offsets = torch.arange(0, len(input_list) + 1, dtype=torch.long)
        output = self.embedding(indices, offsets)

        avg_embedding = output.mean(dim=0)

        return avg_embedding


Instantiate the model with vectors of size 100 and forward pass the following sentences through your module

In [12]:
sent1 = ["the", "quick", "brown"]
sent2 = ["the", "sly", "fox", "jumped"]
sent3 = ["the", "dog", "found","a","lion"]

#Instantiate model
embedding_dim = 100
my_model = MyWordEmbeddingBag(vocab, embedding_dim)

#forward pass sentences
assert(len(my_model(sent1))==100)
assert(len(my_model(sent2))==100)
assert(len(my_model(sent3))==100)

  indices = torch.argmax(torch.tensor(one_hot_embedding(input_list, vocab)), dim=1)


Compute the euclidean distance between "fox" and "dog" using the randomly initialized embedding table in your model above. Note as this is randomly initialized the distances will also be random in this case, however a trained model using word embeddings will often exhibit closer distances between related words, depending on objective.

In [14]:
# Compute the euclidean distance between "fox" and "dog"
fox = vocab.index("fox")
dog  = vocab.index("dog")

# extract the weights from the model
embedding_weights = my_model.embedding.weight.detach()

# get the embeddings for "fox" and "dog"
fox_embedding = embedding_weights[fox]
dog_embedding = embedding_weights[dog]

# compute Euclidean distance
euclidean_distance = torch.dist(fox_embedding, dog_embedding)

print(f"Euclidean distance between 'fox' and 'dog': {euclidean_distance.item():.2f}")


Euclidean distance between 'fox' and 'dog': 14.64


### Recurrent Neural Networks

We will experiment with recurrent networks using the MNIST dataset.

In [15]:
import torchvision
import torch
import torchvision.transforms as transforms

from torch.utils.data import Subset

### Hotfix for very recent MNIST download issue https://github.com/pytorch/vision/issues/1938
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

dataset = torchvision.datasets.MNIST('./', download=True, transform=transforms.Compose([transforms.ToTensor()]), train=True)
train_indices = torch.arange(0, 10000)
train_dataset = Subset(dataset, train_indices)

dataset=torchvision.datasets.MNIST('./', download=True, transform=transforms.Compose([transforms.ToTensor()]), train=False)
test_indices = torch.arange(0, 10000)
test_dataset = Subset(dataset, test_indices)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 335158255.23it/s]


Extracting ./MNIST/raw/train-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 39586828.05it/s]


Extracting ./MNIST/raw/train-labels-idx1-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 29839458.93it/s]

Extracting ./MNIST/raw/t10k-images-idx3-ubyte.gz to ./MNIST/raw






Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 6743550.01it/s]

Extracting ./MNIST/raw/t10k-labels-idx1-ubyte.gz to ./MNIST/raw






In [16]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64,
                                          shuffle=True, num_workers=0)

test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16,
                                          shuffle=False, num_workers=0)

Consider the following script (modified from https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/recurrent_neural_network/main.py) which trains an RNN on the MNIST data. Here we can consider each column of the image as an input for each step of the RNN. After 28 steps the model applies a linear layer + cross-entropy. We will use this to familiarize ourselves with the nn.RNN module and the nn.LSTM module. First run the cell below



In [18]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
sequence_length = 28
input_size = 28
hidden_size = 128
num_layers = 2
num_classes = 10
batch_size = 100
num_epochs = 2
learning_rate = 0.01


# Recurrent neural network (many-to-one)
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Set initial hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)

        # Forward propagate RNN
        out , _ = self.rnn(x, h0)  # out: tensor of shape (batch_size, seq_length, hidden_size)

        # Decode the hidden state of the last time step
        out = self.fc(out[:, -1, :])
        return out

model = RNN(input_size, hidden_size, num_layers, num_classes).to(device)


# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()

        #Gradient clipping
        #torch.nn.utils.clip_grad_norm_(model.parameters(), 0.2)

        # Print the gradient norm for model.rnn.weight_ih_l0 after the first minibatch
        if i == 0:
            weight_ih_grad_norm = torch.norm(model.rnn.weight_ih_l0.grad)
            print(f'gradient norm of model.rnn.weight_ih_l0 after first minibatch: {weight_ih_grad_norm.item()}')

        optimizer.step()

        if (i+1) % 10 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

gradient norm of model.rnn.weight_ih_l0 after first minibatch: 0.04013664647936821
Epoch [1/2], Step [10/157], Loss: 2.3971
Epoch [1/2], Step [20/157], Loss: 2.4714
Epoch [1/2], Step [30/157], Loss: 2.3240
Epoch [1/2], Step [40/157], Loss: 2.3662
Epoch [1/2], Step [50/157], Loss: 2.3167
Epoch [1/2], Step [60/157], Loss: 2.3435
Epoch [1/2], Step [70/157], Loss: 2.3859
Epoch [1/2], Step [80/157], Loss: 2.4344
Epoch [1/2], Step [90/157], Loss: 2.3919
Epoch [1/2], Step [100/157], Loss: 2.4995
Epoch [1/2], Step [110/157], Loss: 2.4760
Epoch [1/2], Step [120/157], Loss: 2.3804
Epoch [1/2], Step [130/157], Loss: 2.4192
Epoch [1/2], Step [140/157], Loss: 2.3308
Epoch [1/2], Step [150/157], Loss: 2.4480
gradient norm of model.rnn.weight_ih_l0 after first minibatch: 2.87736675090855e-05
Epoch [2/2], Step [10/157], Loss: 2.2943
Epoch [2/2], Step [20/157], Loss: 2.3986
Epoch [2/2], Step [30/157], Loss: 2.4726
Epoch [2/2], Step [40/157], Loss: 2.3181
Epoch [2/2], Step [50/157], Loss: 2.3061
Epoch [

Modify the above code (no need to create a new cell) to print the gradient norm of some of the parameters after backward in the the first minibatch. Do this for the following weight parameter: model.rnn.weight_ih_l0. -> implemented in above cell

Modify the code (in a new cell below) to use LSTM  (and remove the gradient clipping) and rerun the code. Note this is essentially what is done in the original script linked above which you may check for reference or if you get stuck. Run with LSTM and compare the accuracy and the gradient norm for weight_ih_l0 of the recurrent network

In [19]:
# LSTM
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)

        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])

        return out

model_lstm = LSTM(input_size, hidden_size, num_layers, num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer_lstm = torch.optim.Adam(model_lstm.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model_lstm(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer_lstm.zero_grad()
        loss.backward()

        # Print the gradient norm for model_lstm.lstm.weight_ih_l0 after the first minibatch
        if i == 0:
            weight_ih_l0_grad_norm = torch.norm(model_lstm.lstm.weight_ih_l0.grad)
            print(f'Gradient Norm of model_lstm.lstm.weight_ih_l0 after first minibatch: {weight_ih_l0_grad_norm.item()}')

        optimizer_lstm.step()

        if (i+1) % 10 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model_lstm.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
        outputs = model_lstm(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the LSTM model on the 10000 test images: {} %'.format(100 * correct / total))

Gradient Norm of model_lstm.lstm.weight_ih_l0 after first minibatch: 0.007739682216197252
Epoch [1/2], Step [10/157], Loss: 2.2484
Epoch [1/2], Step [20/157], Loss: 1.9376
Epoch [1/2], Step [30/157], Loss: 1.2920
Epoch [1/2], Step [40/157], Loss: 1.4531
Epoch [1/2], Step [50/157], Loss: 0.7231
Epoch [1/2], Step [60/157], Loss: 0.9452
Epoch [1/2], Step [70/157], Loss: 0.7732
Epoch [1/2], Step [80/157], Loss: 0.6432
Epoch [1/2], Step [90/157], Loss: 0.8362
Epoch [1/2], Step [100/157], Loss: 0.5037
Epoch [1/2], Step [110/157], Loss: 0.6440
Epoch [1/2], Step [120/157], Loss: 0.4464
Epoch [1/2], Step [130/157], Loss: 0.5817
Epoch [1/2], Step [140/157], Loss: 0.2926
Epoch [1/2], Step [150/157], Loss: 0.5951
Gradient Norm of model_lstm.lstm.weight_ih_l0 after first minibatch: 0.3933633863925934
Epoch [2/2], Step [10/157], Loss: 0.3511
Epoch [2/2], Step [20/157], Loss: 0.3241
Epoch [2/2], Step [30/157], Loss: 0.3752
Epoch [2/2], Step [40/157], Loss: 0.4754
Epoch [2/2], Step [50/157], Loss: 0.3