# Homework 2: Language Models and Neural Networks
#### CSCI 3832 Natural Language Processing

*Your name and email here*

In this homework we're going to be looking at the bigram language model you've implemented in class, and extend it to trigrams.

Instead of looking at the Bible, we'll re-visit the sentiment analysis problem from the previous homework. This dataset contains a split of unlabeled movie reviews. We'll train our language model using this unlabeled split (i.e. we'll pretrain our language model) and then we'll use this model as a starting off point for a neural classification model (i.e. finetuning), which we'll use to do sentiment classification. Finally, we'll replace our trained embeddings with Glove pretrained vectors, to see if we get any improvement.

## Section 1: Neural Language Modeling

We'll first load the unsupervised data. Set the data dir below to the directory you used for Homework 1, to prevent copying the data twice.

In [2]:
conda install pytorch

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.13.0
  latest version: 23.1.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [4]:
import os, random, sys, matplotlib.pyplot as plt
import torch, torch.nn as nn, numpy
from nltk.tokenize import word_tokenize

In [5]:
data_dir = '/Users/a511/Desktop/cu23 spring/3832/aclImdb/'
data_limit = 15000

def read_folder(folder):
    examples = []
    for fname in os.listdir(folder)[:data_limit]:
        with open(os.path.join(folder, fname), encoding='utf8') as f:
            examples.append(f.readline().strip())
    return examples

unsup_examples = read_folder(os.path.join(data_dir, 'train/unsup/'))
print(unsup_examples[0])

A newspaperman (Johnny Twennies) living in the 90's with a complete 20's personality and lifestyle - fedora, manual typewriter, the Charleston, the works. It's a great idea for a movie and it couldn't have been done better.<br /><br />Johnny doesn't miss a cliche, but never uses the same one twice. You'll find yourself anticipating his reactions to the harsher '90s world as the movie goes along, you'll often guess right - but that makes the movie just that much more fun.<br /><br />Lots of fun when Johnny is called on to save the same damsel in distress (named Virginia, natch) on three different occasions. She responds with appropriate fluttering eyelids each time.<br /><br />His reaction to independent women, openly gay men, and the general '90s milieu is delightful. He remains happily oblivious.<br /><br />Don't worry, the movie never takes itself seriously. Nobody preaches about the evil of the present, or the shallowness of the past. You end up with a warm feeling for all the chara

The dataset also comes with a pre-made vocabulary, which we'll rely on for this section of the homework. We'll eventually convert our words to indices, so lets store the words in a dictionary, mapping each to a unique integer.

In [6]:
vocabulary_file = os.path.join(data_dir, 'imdb.vocab')

raw_vocabulary = []
with open(vocabulary_file, 'r', encoding='utf8') as f:
    for line in f:
        raw_vocabulary.append(line.strip())

#Limit our vocabulary size to top 5k words
raw_vocabulary = raw_vocabulary[:5000]

# Add in our special tokens
special_tokens = ['<s>', '</s>', '<unk>']

vocabulary = {}

'''

Your code here.

Create the vocabulary dictionary by prepending the special tokens to the raw vocabulary, and enumerating them.

10 pts.

'''
vocabulary_list = special_tokens + raw_vocabulary
vocabulary = {word: index for index, word in enumerate(vocabulary_list)}


In [7]:
assert isinstance(vocabulary, dict)
assert len(vocabulary) == 5003
assert vocabulary['<s>'] == 0
assert vocabulary['significance'] == 5002


In [11]:
print(vocabulary)



Now that we have a vocabulary, we can process the unsupervised examples we loaded earlier into actual training data our model can read.

First, we'll tokenize the text normally:

In [12]:
"""
This block may take a while (<5 minutes) to run, but you only have to run it once, so make sure you don't modify the tokenized_examples list after it's completed.
While you're writing your code, consider limiting unsup_examples to the first 5 examples as a smoke test before you run the loop over all examples
"""

from nltk.tokenize import word_tokenize

tokenized_examples = []
sos_id = vocabulary['<s>'] #start of sequence
eos_id = vocabulary['</s>'] #end of sequence
unk_id = vocabulary['<unk>']

for example in unsup_examples:
    example_tokens = [token.lower() for token in word_tokenize(example)]

    token_ids = [sos_id]
    for token in example_tokens:
        '''
            Your code here.

            The above loop iterates over the tokens in a single example. If a token is in our vocabulary, then add it to token_ids. If not, add the unknown token.

            10 pts.
            

        '''


    token_ids.append(eos_id)
    tokenized_examples.append(token_ids)

In [14]:
assert len(tokenized_examples[0]) == 191
assert tokenized_examples[0] == [0, 11, 940, 2, 3, 86, 2091, 6, 107, 618, 158, 134, 2, 25, 43, 22, 16, 71, 2, 6, 3, 2587, 42, 38, 2, 653, 2, 2, 11, 27, 2361, 2, 29, 11, 440, 2, 3, 2260, 2, 2, 4, 108, 70, 55, 51, 2, 3, 229, 4300, 4, 522, 2648, 2, 13, 1517, 3053, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 38, 11, 14, 2, 44, 12, 29, 2, 4, 468, 8, 388, 854, 7, 1103, 2, 2, 2, 345, 2, 2738, 294, 2, 11, 120, 39, 3, 471, 154, 2, 19, 201, 115, 6, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 29, 1731, 695, 2, 387, 21, 2, 3, 473, 25, 410, 7, 77, 2, 2, 2, 4, 16, 44, 291, 228, 2, 1300, 1796, 2, 84, 50, 2, 2, 10, 3, 943, 2, 367, 291, 2587, 334, 1320, 33, 2, 31, 1267, 2, 444, 4, 323, 2, 2, 49, 2, 2, 4, 16, 90, 2, 66, 2, 2, 173, 42, 377, 2, 2, 25, 598, 2, 14, 12, 34, 387, 2, 6, 2, 2, 6, 1951, 2, 49, 1]

AssertionError: 

Now we can create our bigram data. We'll make use of the torch Dataset class. We only need to implement the `__getitem__` and `__len__` methods to make this work with other existing torch tools.

For this dataset, for each example, iterate over its bigrams. If either one of the tokens is an unknown token, then do not save the bigram. Since we're using a small vocabulary, we'll have a lot of unknowns, and we don't want our model to always predict this token as the most likely next token.

Note that with a normal sized vocabulary, training set, and model, you wouldn't necessarily want to do this -- unknowns would hopefully be relatively rare.

In [None]:
import torch

class BigramDataset(torch.utils.data.Dataset):

    def __init__(self, tokenized_data):

        self.examples = []
        for example in tokenized_data:              #Iterate over our dataset
            for i in range(0,len(example) - 1):     #Iterate over the tokens of the example
                '''
                    Your code here.

                    Bigrams should be a tuple of integers: (example[i], example[i+1])
                    For each bigram, if either of example[i] or example[i+1] are unknown then do not add the bigram to our examples.

                    10 pts.
                '''

    def __getitem__(self, idx):

        return self.examples[idx]

    def __len__(self):

        return len(self.examples)




Now we'll define the bigram model. This is similar to the one in class: the input is a single token, and the model outputs a probability over the whole vocabulary.

In [None]:
import torch.nn as nn

class BigramLM(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_hidden_layers):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.hidden_layer_1 = nn.Linear(embedding_dim, hidden_dim)
        self.hidden_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_hidden_layers - 1)]
        )
        self.output_layer = nn.Linear(hidden_dim, vocab_size)

        self.relu = nn.ReLU()

    def forward(self, input):

        embedding = self.embedding(input)

        hidden = self.relu(self.hidden_layer_1(embedding))

        for layer in self.hidden_layers:
            hidden = self.relu(layer(hidden))

        output = self.output_layer(hidden)

        return output

Now we'll train the model. This training loop is similar to the one shown in lecture, with a couple of differences.

In [None]:
# Training Loop

#Initialize our model -- keep it small with 1 hidden layer, and embedding sizes of 50
bigram_model = BigramLM(len(vocabulary), 50, 50, 1)

#Initialize our dataset using a subset of examples
bigram_dataset = BigramDataset(tokenized_examples[:5000])

criteria = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(bigram_model.parameters())            #AdamW is a popularly used optimizer
# optimizer = torch.optim.SGD(bigram_model.parameters(), lr=0.5)    #Either of these optimizers could be used

softmax = nn.Softmax(dim=2)

epochs = 3
batch_size = 32
print_frequency = 1000

#We'll create an instance of a torch dataloader to collate our data. This class handles batching and shuffling (should be done each epoch)
train_dataloader = torch.utils.data.DataLoader(bigram_dataset, batch_size=batch_size, shuffle=True)

for i in range(epochs):
    print('### Epoch: ' + str(i+1) + ' ###')

    bigram_model.train()
    avg_loss = 0

    for step, data in enumerate(train_dataloader):

        x, y = data
        x = x.unsqueeze(1)


        optimizer.zero_grad()

        model_output = bigram_model(x)
        model_output_probabilities = softmax(model_output)

        loss = criteria(model_output_probabilities.squeeze(1), y)

        loss.backward()
        optimizer.step()

        avg_loss += loss.item()
        if step % print_frequency == 1:
            print('epoch: {} batch: {} loss: {}'.format(
                i,
                step,
                avg_loss / print_frequency
            ))
            avg_loss = 0

Use the loop above to train the model for at least 1 epoch.

1. Modify the loop to keep track of the average loss before it's reset. Then, plot the losses using matplotlib below.

In [None]:
'''
Your code here.

The average loss is reset after print_frequency iterations. Before it's set to 0, store it in a list that will persist throughout training.

10 pts.

'''

Now we'll modify our model and dataset to create a trigram language model. Here, the input will be two words rather than 1. The output will remain the same.

Hint: since we have two inputs, we'll want to combine them in some way after we get their embeddings. An easy way to do this would be to concatenate the two embeddings together, creating a new vector of size 2*embedding_dim. This will be the input size of the first hidden dimension.

In [None]:
class TrigramDataset(torch.utils.data.Dataset):

    def __init__(self, tokenized_data):

        raise NotImplementedError

    def __getitem__(self, idx):

        raise NotImplementedError

    def __len__(self):

        raise NotImplementedError

In [None]:
class TrigramLM(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_hidden_layers):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)


        self.hidden_layer_1 = nn.Linear(embedding_dim, hidden_dim)  #Embedding_dim will have to be modified
        self.hidden_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_hidden_layers - 1)]
        )
        self.output_layer = nn.Linear(hidden_dim, vocab_size)

        self.relu = nn.ReLU()

    def forward(self, input_1, input_2):
        # Hint: we'll need to get an embedding for our second input somehow
        # self.embedding_1 = self.embedding(input_1)
        # self.embedding_2 =

        # Hint: This might be one way to combine our embeddings
        # self.embedding = torch.cat()

        raise NotImplementedError

In [None]:

def train_trigram(trigram_model, trigram_dataset):

    criteria = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(trigram_model.parameters())
    # optimizer = torch.optim.SGD(trigram_model.parameters(), lr=0.5)

    softmax = nn.Softmax(dim=2)

    epochs = 3
    batch_size = 32
    print_frequency = 1000


    train_dataloader = torch.utils.data.DataLoader(trigram_dataset, batch_size=batch_size, shuffle=True)

    for i in range(epochs):
        print('### Epoch: ' + str(i+1) + ' ###')

        trigram_model.train()
        avg_loss = 0

        for step, data in enumerate(train_dataloader):

            x, y = data

            x = x.unsqueeze(1)

            optimizer.zero_grad()

            model_output = trigram_model(x)
            model_output_probabilities = softmax(model_output)

            loss = criteria(model_output_probabilities.squeeze(1), y)

            loss.backward()
            optimizer.step()

            avg_loss += loss.item()
            if step % print_frequency == 1:
                print('epoch: {} batch: {} loss: {}'.format(
                    i,
                    step,
                    avg_loss / print_frequency
                ))
                avg_loss = 0

In [None]:
"""
Complete the following code block by initializing the model/dataset and training for at least one epoch
Hint: the models and dataset should be _extremely_ similar to the bigram model and dataset

20 pts.
"""

trigram_model = TrigramLM()
trigram_dataset = TrigramDataset()


train_trigram(trigram_model, trigram_dataset)

To complete this section, complete the Trigram model and dataset, and train the model for at least 1 epoch.

## Section 2: Sentiment Analysis

In this section we'll compare how a neural model similar to the one above performs on sentiment analysis. Then, we'll replace the embeddings with pretrained ones to see if that increases our performance. To make our life easier, we'll use the glove vocabulary for both models.

You can download the embeddings from here: https://nlp.stanford.edu/projects/glove/

The glove vectors are distributed as a text file, with the word in the first column, and the embeddings in the remaining columns. We'll read in the embeddings here.

In [None]:
glove_file = 'glove.6B.50d.txt'

embeddings_dict = {}

with open(glove_file, 'r', encoding='utf8') as f:
    for i, line in enumerate(f):
        if i == 0:
            print(line)
        line = line.strip().split(' ')
        word = line[0]
        embed = numpy.asarray(line[1:], "float")

        embeddings_dict[word] = embed

print('Loaded {} words from glove'.format(len(embeddings_dict)))

embedding_matrix = numpy.zeros((len(embeddings_dict)+1, 50)) #add 1 for padding

word2id = {}
for i, word in enumerate(embeddings_dict.keys()):

    word2id[word] = i                                #Map each word to an index
    embedding_matrix[i] = embeddings_dict[word]      #That index holds the Glove embedding in the embedding matrix

# Our joint vocabulary for both models / sanity check to see if we've loaded it correctly:
print(word2id['the'])
print(embedding_matrix[word2id['the']])

word2id['<pad>'] = embedding_matrix.shape[0] - 1
print(embedding_matrix[word2id['<pad>']])


We'll create another dataset for our (now labeled) movie reviews. Do not change the max_length values.

In [None]:
# Create a classification dataset for the movie reviews


class MovieReviewDataset(torch.utils.data.Dataset):

    def __init__(self, directory=None, split=None, word2id=None, finalized_data=None, data_limit=250, max_length=256):
        """
        :param directory: The location of aclImdb
        :param split: Train or test
        :param word2id: The generated glove word2id dictionary
        :param finalized_data: We'll use this to initialize a validation set without reloading the data.
        :param data_limit: Limiter on the number of examples we load
        :param max_length: Maximum length of the sequence
        """

        self.data_limit = data_limit
        self.max_length = max_length
        self.word2id = word2id

        if finalized_data:
            self.data = finalized_data

        else:

            pos_dir = directory + '{}/pos/'.format(split)
            neg_dir = directory + '{}/neg/'.format(split)

            pos_examples = self.read_folder(pos_dir)
            neg_examples = self.read_folder(neg_dir)

            pos_examples_tokenized = [(ids, 1) for ids in self.tokenize(pos_examples)]
            neg_examples_tokenized = [(ids, 0) for ids in self.tokenize(neg_examples)]

            self.data = pos_examples_tokenized + neg_examples_tokenized

            random.shuffle(self.data)

    def read_folder(self, folder):
        examples = []
        for fname in os.listdir(folder)[:self.data_limit]:
            with open(os.path.join(folder, fname), encoding='utf8') as f:
                examples.append(f.readline().strip())
        return examples

    def tokenize(self, examples):

        example_ids = []
        misses = 0              # Count the number of tokens in our dataset which are not covered by glove -- i.e. percentage of unk tokens
        total = 0
        for example in examples:
            tokens = word_tokenize(example)
            ids = []
            for tok in tokens:
                if tok in word2id:
                    ids.append(word2id[tok])
                else:
                    misses += 1
                    ids.append(word2id['unk'])
                total += 1

            if len(ids) >= self.max_length:
                ids = ids[:self.max_length]
            else:
                ids = ids + [word2id['<pad>']]*(self.max_length - len(ids))
            example_ids.append(torch.tensor(ids))
        print('Missed {} out of {} words -- {:.2f}%'.format(misses, total, misses/total))
        return example_ids

    def generate_validation_split(self, ratio=0.8):

        split_idx = int(ratio * len(self.data))

        # Take a chunk of the processed data, and return it in order to initialize a validation dataset
        validation_split = self.data[split_idx:]

        #We'll remove this data from the training data to prevent leakage
        self.data = self.data[:split_idx]

        return validation_split


    def __getitem__(self, item):
        return self.data[item]

    def __len__(self):
        return len(self.data)


We'll define our two models: the randomly initialized RandomModel and the GloveModel where we use the pretrained vectors.

In [None]:
# Define a simple classification model
class RandomModel(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_hidden_layers, max_length=256):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.hidden_layer_1 = nn.Linear(embedding_dim, hidden_dim)
        self.hidden_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_hidden_layers - 1)]
        )


        self.output_layer = nn.Linear(hidden_dim, 1)

        self.relu = nn.ReLU()

    def forward(self, input):

        embedding = self.embedding(input).squeeze(1)
        embedding = torch.sum(embedding, dim=1)

        hidden = self.relu(self.hidden_layer_1(embedding))
        for layer in self.hidden_layers:
            hidden = self.relu(layer(hidden))

        output = self.output_layer(hidden)
        return output

# Define a Glove classification model
class GloveModel(nn.Module):

    def __init__(self, pretrained_embedding, hidden_dim, num_hidden_layers, max_length=256):
        super().__init__()

        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(pretrained_embedding))
        self.hidden_layer_1 = nn.Linear(pretrained_embedding.shape[1] * max_length, hidden_dim)
        self.hidden_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_hidden_layers - 1)]
        )
        self.output_layer = nn.Linear(hidden_dim, 1)

        self.relu = nn.ReLU()

    def forward(self, input):

        embedding = self.embedding(input).squeeze(1)
        embedding = torch.sum(embedding, dim=1)

        hidden = self.relu(self.hidden_layer_1(embedding))
        for layer in self.hidden_layers:
            hidden = self.relu(layer(hidden))

        output = self.output_layer(hidden)

        return output

Here we'll define a new prediction method. It will take the output of the model and classify it as 0 if it's below the threshold (0.5) or 1 otherwise.

We'll use this method to log our validation accuracy as we train.

In [None]:
def predict(model, valid_dataloader):

    sigmoid = nn.Sigmoid()

    total_correct = 0
    total_examples = len(valid_dataloader)

    for x,y in valid_dataloader:

        x = x.unsqueeze(1)
        output = sigmoid(model(x))

        if (output < 0.5 and y == 0) or (output >= 0.5 and y == 1):
            total_correct += 1

    accuracy = total_correct / total_examples
    print('accuracy: {}'.format(accuracy))
    return accuracy

Finally, we'll define the training loop for these models.

In [None]:
def train_classification(model, train_dataset, valid_dataset, epochs=100, batch_size=32, print_frequency=100):

    criteria = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.AdamW(model.parameters())            
    

    epochs = epochs
    batch_size = batch_size
    print_frequency = print_frequency

    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=1, shuffle=False)

    for i in range(epochs):
        print('### Epoch: ' + str(i+1) + ' ###')

        model.train()
        avg_loss = 0

        for step, data in enumerate(train_dataloader):

            x, y = data
            x = x.unsqueeze(1)

            optimizer.zero_grad()

            model_output = model(x)

            loss = criteria(model_output.squeeze(1), y.float())

            loss.backward()
            optimizer.step()

            avg_loss += loss.item()
            if step % print_frequency == 1:
                print('epoch: {} batch: {} loss: {}'.format(
                    i,
                    step,
                    avg_loss / print_frequency
                ))
                avg_loss = 0

        model.eval()
        with torch.no_grad():
            predict(model, valid_dataloader)

Initialize the training and validation datasets/dataloaders

In [None]:
train_dataset = MovieReviewDataset('../Homework 1/aclImdb/', 'train', word2id)
validation_examples = train_dataset.generate_validation_split()
print('Loaded {} train examples'.format(train_dataset.__len__()))

valid_dataset = MovieReviewDataset(finalized_data=validation_examples, word2id=word2id)
print('Loaded {} validation examples'.format(valid_dataset.__len__()))

In the following two code blocks, initialize a new RandomModel in one, and a new GloveModel in the other -- use train_classification() to train them. For each model, find a set of model parameters (i.e. play around with the number of hidden layers and the hidden layer size) and training parameters (epochs, batch size) which give you a good (>70) validation accuracy.

Some tips:
    1. Given your resources, first try and prioritize how many data examples you load. This is controlled by the data_limit value of the dataset.
    2. For previous models, we've only trained for 1-3 epochs due to the large number of parameters when language modeling. You may need to train for a considerably longer time (>30-50 epochs) to get results
    3. Performance is both a function of training time and the model itself. Keep an eye on the validation accuracy in case the model is overfitting (can be prevented by using more examples)
    4. Right now, every hidden layer is the same dimension. Consider widening or narrowing some layers.

Additionally, modify the training loop to collect validation set accuracies after each epoch (the predict method is already returning these values). For each model, plot the training loss and validation accuracy over time.

In [None]:
'''

Initialize the RandomModel here.


Your code here

10 pts.
'''

random_model = RandomModel()
train_classification(random_model, train_dataset, valid_dataset)

In [None]:
'''

Initialize the GloveModel here.


Your code here

10 pts.
'''

glove_model = GloveModel()
train_classification(glove_model, train_dataset, valid_dataset)

Once you've finished tuning parameters, test the two models on the test set.

In [None]:
test_dataset = MovieReviewDataset('../Homework 1/aclImdb/', 'test', word2id)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

print('Random model accuracy: ')
predict(random_model, test_dataloader)

print('Glove model accuracy: ')
predict(glove_model, test_dataloader)

## Free Response Questions (20 pts.):
1. Compare the performance of the Glove model vs the Random model. Refer to the validation accuracy curves and the test set results in your answer.
2. Compare the training loop between the supervised and unsupervised models. What's different (outside of code features like predicting validation accuracy after each epoch)?

_Your answer here._