<a href="https://colab.research.google.com/github/rpmullig/MISCELLANEOUS/blob/master/hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Homework 1

- Deadline: 11:59 pm, Monday, July 13th, 2020
- Name: Robert Mulligan

In this programming assignment, you will work on three tasks:  (1) sentiment analysis using logistic regression with bag of words, (2) sentiment analysis with word embeddings, and (3) and language modeling with RNNs.  To help you quickly get started most of the required code has already been provided.  You primary task is to undertand the provided code and fill in the gaps.

You will use PyTorch extensively.  We recommend reading the first three tutorials on "Deep Learning for NLP with PyTorch" by Robert Guthrie https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html for this assignment.  Other tutorials are also useful, and they are mentioned later in this notebook.

You should also seek our help unhesitatingly.  We want you to learn a lot of material in a short period of time, and our help will make it easier.  Please post your questions on Piazza or meet with us during office hours.

### Sentiment Analysis

For sentiment analysis you will work on large movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/). The task is to classify movie reviews into two categories, POSITIVE or NEGATIVE. 

You are provided with a training set (TRAIN), a development set (DEV), and a test set (TEST). Your classifier is trained on TRAIN, evaluated and tuned on DEV, and tested on TEST. 

You will build two classifiers in this homework, a logistic regression classifier with bag of words features and a neural network classifier with word embeddings.  For the logistic regression classifier with bag of words, we will preprocess the data from scratch. For the neural network based classifier with word embeddings, we will use torchtext for preprocessing the data. 

### Language modeling

The last problem is on language modelling.  It may take a lot of time on a CPU. So, you can try to run it on Google collab.  Go to https://colab.research.google.com and log in using your Google account.  To upload a python notebook, click on "Files" dropdown menu and the upload notebook. To use GPU click the “Runtime” dropdown menu. Select “Change runtime type”. Select python2 or 3 from “Runtime type” dropdown menu and choose hardware accelerator as GPU. You can find detailed instructions on how to use google collab on this webpage https://www.geeksforgeeks.org/how-to-use-google-colab/

In [None]:
import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import Counter, defaultdict
import operator
import os, math
import numpy as np
import random
import copy
# from nltk import word_tokenize

def word_tokenize(s):
    return s.split()

# set the random seeds so the experiments can be replicated exactly
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
if torch.cuda.is_available():
    torch.cuda.manual_seed(53113)

# Global class labels.
POS_LABEL = 'pos'
NEG_LABEL = 'neg'     

In [None]:
def load_data(data_file):
    data = []
    with open(data_file,'r', encoding= "Latin-1") as fin:
        for line in fin:
            label, content = line.split(",", 1)
            data.append((content.lower(), label))
    return data
data_dir = "/content/drive/My Drive/large_movie_review_dataset" # adjusted for Google Drive (mounted)
train_data = load_data(os.path.join(data_dir, "train.txt"))
dev_data = load_data(os.path.join(data_dir, "dev.txt"))

def load_test_data(data_file):
    data = []
    with open(data_file,'r', encoding= "Latin-1") as fin:
        for line in fin:
            data.append(line.strip())
    return data

test_data = load_test_data(os.path.join(data_dir, "test.txt"))

In [None]:
print("number of TRAIN data", len(train_data))
print("number of DEV data", len(dev_data))

We define a generic model class as below. The model has 2 functions, train and classify. 

In [None]:
VOCAB_SIZE = 5000
class Model:
    def __init__(self, data):
        # Vocabulary is a set that stores every word seen in the training data
        self.vocab = Counter([word for content, label in data for word in word_tokenize(content)]).most_common(VOCAB_SIZE-1) 
        self.word_to_idx = {k[0]: v+1 for v, k in enumerate(self.vocab)} # word to index mapping
        self.word_to_idx["UNK"] = 0 # all the unknown words will be mapped to index 0
        self.idx_to_word = {v:k for k, v in self.word_to_idx.items()}
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.idx_to_label = [POS_LABEL, NEG_LABEL]
        self.vocab = set(self.word_to_idx.keys())
        
    def train_model(self, data):
        '''
        Train the model with the provided training data
        '''
        raise NotImplementedError
        
    def classify(self, data):
        '''
        classify the documents with the model
        '''
        raise NotImplementedError

## Sentiment Analysis with Logistic Regression and Bag of Words

You will implement logistic regression with bag of words features in the following. 

In [None]:
class TextClassificationDataset(tud.Dataset):
    '''
    PyTorch provide a common dataset interface. 
    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    The dataset encodes documents into indices. 
    With the PyTorch dataloader, you can easily get batched data for training and evaluation. 
    '''
    def __init__(self, word_to_idx, data):
        
        self.data = data
        self.word_to_idx = word_to_idx
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.vocab_size = VOCAB_SIZE
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = np.zeros(self.vocab_size)
        
        item = torch.from_numpy(item)
        if len(self.data[idx]) == 2: # in training or evaluation, we have both the document and label
            for word in word_tokenize(self.data[idx][0]):
                item[self.word_to_idx.get(word, 0)] += 1
            label = self.label_to_idx[self.data[idx][1]]
            return item, label
        else: # in testing, we only have the document without label
            for word in word_tokenize(self.data[idx]):
                item[self.word_to_idx.get(word, 0)] += 1
            return item

In [None]:
best_model = None
class BoWLRClassifier(nn.Module, Model):
    '''
    Define your logistic regression model with bag of words features.
    '''
    def __init__(self, data):
        nn.Module.__init__(self)
        Model.__init__(self, data)
        
        '''
        In this model initialization phase, you will do the following: 
        1. Define a linear layer to transform bag of words features into 2 classes. 
        2. Define the loss function, you will use cross entropy loss
            https://pytorch.org/docs/stable/nn.html?highlight=crossen#torch.nn.CrossEntropyLoss
        3. Define an optimizer for the model, you may choose to use SGD, Adam or other optimizers you know
            https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.SGD
        '''
        # added
        self.linear = nn.Linear(VOCAB_SIZE, 2) 
        self.loss_fn = nn.CrossEntropyLoss() 
        self.optimizer = torch.optim.SGD(params=self.parameters(), lr=0.001) 
        
        
    def forward(self, bow):
        '''
        Run the model. You may only need to run the linear layer defined in the init function. 
        '''
        out = self.linear(bow) 
        return out 
    
    def train_epoch(self, train_data):
        '''
        Train the model for one epoch with the training data
        When training a model, you will repeat the following procedures:
        1. get one batch of features and labels
        2. make a forward pass with the features to get predictions
        3. calculate the loss with the predictions and target labels
        4. run a backward pass from the loss function to get the gradients
        5. apply the optimizer step to update the model paramters
        '''
        dataset = TextClassificationDataset(self.word_to_idx, train_data)
        dataloader = tud.DataLoader(dataset, batch_size=8, shuffle=True)
        self.train()
        for i, (X, y) in enumerate(dataloader):
            X = X.float()
            y = y.long()
            if torch.cuda.is_available():
                X = X.cuda()
                y = y.cuda()
            self.optimizer.zero_grad()
            preds = self.forward(X)
            loss = self.loss_fn(preds, y)
            loss.backward()
            if i % 500 == 0:
                print("loss: {}".format(loss.item()))
            self.optimizer.step()
    
    def train_model(self, train_data, dev_data):
        """
        This function processes the entire training set for multiple epochs.
        After each training epoch, you will evaluate your model on the DEV set. 
        The best performing model on the DEV set shall be saved to best_model
        """  
        dev_accs = [0.]
        for epoch in range(2): # increase Epochs, results have not been helpful
            self.train_epoch(train_data)
            dev_acc = self.evaluate(dev_data)
            print("dev acc: {}".format(dev_acc))
            if dev_acc > max(dev_accs):
                best_model = copy.deepcopy(self)
            dev_accs.append(dev_acc)

    def classify(self, docs):
        '''
        This function classifies documents into their categories. 
        docs are documents only, without labels.
        '''
        dataset = TextClassificationDataset(self.word_to_idx, docs)
        dataloader = tud.DataLoader(dataset, batch_size=1, shuffle=False)
        results = []
        with torch.no_grad():
            for i, X in enumerate(dataloader):
                X = X.float()
                if torch.cuda.is_available():
                    X = X.cuda()
                preds = self.forward(X)
                results.append(preds.max(1)[1].cpu().numpy().reshape(-1))
        results = np.concatenate(results)
        results = [self.idx_to_label[p] for p in results]
        return results
                
    def evaluate(self, data):
        '''
        This function evaluate the data with the current model. 
        data contains documents and labels. 
        It calls function "classify" to make predictions, 
        and compare with the correct labels to return the model accuracy on "data". 
        '''
        self.eval()
        preds = self.classify([d[0] for d in data])
        targets = [d[1] for d in data]
        correct = 0.
        total = 0.
        for p, t in zip(preds, targets):
            if p == t: 
                correct += 1
            total += 1
        return correct/total
        

In [None]:
lr_model = BoWLRClassifier(train_data)
if torch.cuda.is_available():
    lr_model = lr_model.cuda()
lr_model.train_model(train_data, dev_data)

Now spend some time to tune your models. At least try the following: 

- try another optimizer
- change the learning rate
- change the number of epochs to train

Report your results and analysis in the writeup. 



In [None]:
# Increasing the epochs and Learning rate did not see significant improvement
# But using Adam optimizer resulted in a 10% accuracy improvement
lr_model.optimizer = optim.Adam(params=lr_model.parameters(), lr=0.001)
lr_model.train_model(train_data, dev_data)

In [None]:
preds = lr_model.classify(test_data)
def write_to_file(preds, filename):
    i = 0
    with open(os.path.join(filename), "w") as fout:
        fout.write("index,label\n")
        for pred in preds:
            fout.write("{},{}\n".format(i, pred))
            i += 1

write_to_file(preds, "lr_test_preds.txt")

In [None]:
# Extract the weights from the linear layer 
weights = lr_model.linear.weight
model_dataset = lr_model.word_to_idx

Identify the top 10 features with the maximum weights for POSITIVE category. Explain your findings. 

In [None]:
pos_values, pos_indices = torch.topk(input=weights[0], k=10, largest=True) # Positive

i = 1
for val, indx in model_dataset.items():       # O(m) search of dictionary (size of table = m) 
  if indx in pos_indices:                     # linear search of 10 elements
    print(f"#{i} \"{val}\" at index {indx}")
    i += 1 

These are the most positive words. As you can tell, they make sense that they're helpful. For a good rating 

Identify the top 10 features with the maximum negative weights for POSITIVE category. Explain your findings. 

In [None]:
pos_values, pos_indices = torch.topk(input=weights[0], k=10, largest=False) # Positive smallest

i = 1
for val, indx in model_dataset.items():       # O(m) search of dictionary (size of table = m) 
  if indx in pos_indices:                     # linear search of 10 elements
    print(f"#{i} \"{val}\" at index {indx}")
    i += 1 

Identify the top 10 features with the maximum positive weights for NEGATIVE category. Explain your findings. 

In [None]:
neg_values, neg_indices = torch.topk(input=weights[1], k=10, largest=True) # Negative 

i = 1
for val, indx in model_dataset.items():       # O(m) search of dictionary (size of table = m) 
  if indx in neg_indices:                     # linear search of 10 elements
    print(f"#{i} \"{val}\" at index {indx}")
    i += 1 

Identify the top 10 features with the maximum negative weights for NEGATIVE category. Explain your findings. 

In [None]:
neg_values, neg_indices = torch.topk(input=weights[1], k=10, largest=False) # Negative smallest

i = 1
for val, indx in model_dataset.items():       # O(m) search of dictionary (size of table = m) 
  if indx in neg_indices:                     # linear search of 10 elements
    print(f"#{i} \"{val}\" at index {indx}")
    i += 1 

## Sentiment Analysis with Word-Embeddings and a Neural Network

We will use [torchtext](https://github.com/pytorch/text) to create vocabulary, and load datasets into batches. Please refer to their GitHub README page for a quick tutorial.  More details are in the set of tutorials at https://github.com/bentrevett/pytorch-sentiment-analysis

In [None]:
import torch
from torchtext import data
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# We are using 'spacy' tokenizer. You can also write your own tokenizer. You can download spacy from
# this site https://spacy.io/usage
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

We know download the IMDB dataser=t using torchtext datasets. This step may take some 10 to 15 mins

In [None]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
print(f'Number of validation examples:{len(test_data)}')

Now we build our vocabulary/dictionary. We are only keeping 25,000 most common words.

In [None]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

One can see vocabulary as well as the indices directly using either the stoi (string to int) or itos (int to string) method.

In [None]:
print(TEXT.vocab.itos[:10])

Next step is to build an iterator. This iterator will return a batch of data every iteration. It also pads every sentence to make all sentences in a batch of equal length.

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

The function binary_accuracy will be used to compute accuracy from logits

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In Class WordEmbAvg we define our model. Our model works in the following way.

1. The input to our model is a batch of sentences. All sentences are made of equal length by padding.
   Every word in the sentence is represented by one-hot encoding. So, sentence is a list of one-hot encoding.

2. In input passes through an embedding layer. The embedding layer converts the one-hot encoding for every word into a word vector.

3. We take the average of all the word vectors in a sentence. This vector is then used as an inout to a neural network.

4. The neural network has only one output. It tells you the probability of the sentence having a particular label. (We will only have two labels)

Please read about embedding from  https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html.


#3.a
Make $v(\cdot)$ the one-hot encoding function of a word $$v(w_{i}) = \{0, 1\}$$

Let $c$ be a sequence of words
$$c = w_{i:k} = w_i, \dots w_k$$

Let $x$ be a vector of one-hot encodings $[v(w_i), v(w_{i+1}), \dots v(w_k)]$

Let $emb(w_i)$ is converting $v(w_i)$ to a word vector

So, $X = [emb(v(w_i)), emb(v(w_{i+1})), \dots emb(v(w_k))]$


$$WordEmbAvg = \frac{\sum(emb(v(w_{i:|V|})))}{k}$$ 

$$h = softmax(WordEmbAvg*w_i + bias_{i})$$



In [None]:
class WordEmbAvg(nn.Module):
    def __init__(self, input_dim, embedding_dim, output_dim, pad_idx):
        
        super().__init__()
        
        # Define embedding layer in the next layer.
        # It should be something like 
        #emb = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        
        self.emb = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        
        #Define your neural network. It can single layer or multiple layer neural network
        # You don't need apply a softmax in the output layer
        
        # single linear layer
        self.linear = nn.Linear(embedding_dim, output_dim) 
        
        
    def forward(self, text):

        
        #Input goes to the embedding layer
       
        output = self.emb(text)
        
        # Take the average of all word embeddngs. Please check how to mean() on a tensor on pytorch
        
        output = torch.mean(output, 0)
        
        # Previous input now goes into the neural network
        
        output = self.linear(output)
         
        return output

In [None]:
class Training_module( ):

    def __init__(self, model):
       self.model = model
    
       #The loss function should be binary cross entropy with logits. 
       self.loss_fn = nn.BCEWithLogitsLoss()
       # Choose your favorite optimizer
       self.optimizer = optim.Adam(self.model.parameters(), lr=0.001)
    
    def train_epoch(self, iterator):
        '''
        Train the model for one epoch with the training data
        When training a model, you will repeat the following procedures:
        1. get one batch of features and labels
        2. make a forward pass with the features to get predictions
        3. calculate the loss with the predictions and target labels
        4. run a backward pass from the loss function to get the gradients
        5. apply the optimizer step to update the model paramters
        '''
        epoch_loss = 0
        epoch_acc = 0
    
    
        for batch in iterator:
        
            self.optimizer.zero_grad()
                
            # Help from posted Github example:
            # https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb
            prediction = self.model(batch.text).squeeze(1)
            loss = self.loss_fn(prediction, batch.label)
            acc = binary_accuracy(prediction, batch.label)
            loss.backward()
            self.optimizer.step()
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
        return epoch_loss / len(iterator), epoch_acc / len(iterator)
    
    def train_model(self, train_iterator, dev_iterator):
        """
        This function processes the entire training set for multiple epochs.
        After each training epoch, you will evaluate your model on the DEV set. 
        The best performing model on the DEV set shall be saved to best_model
        """  
        dev_accs = [0.]
        for epoch in range(15): # incrase epochs from 5 to 20
            self.train_epoch(train_iterator)
            dev_acc = self.evaluate(dev_iterator)
            print("dev acc: {}".format(dev_acc[1]), "dev loss:{}".format(dev_acc[0]))
            if dev_acc[1] > max(dev_accs):
                best_model = copy.deepcopy(self)
            dev_accs.append(dev_acc[1])
        return best_model.model
                
    def evaluate(self, iterator):
        '''
        This function evaluate the data with the current model.
        1. make a forward pass with the features to get predictions
        2. calculate the loss with the predictions and target labels
        3. Use the binary accuracy function to compute the accuracy
        '''
        epoch_loss = 0
        epoch_acc = 0
    
        #model.eval()
    
        with torch.no_grad():
    
            for batch in iterator:

                predictions = self.model(batch.text).squeeze(1)
        
                loss = self.loss_fn(predictions, batch.label)
        
                acc = binary_accuracy(predictions, batch.label)
        
                epoch_loss += loss.item()
                epoch_acc += acc.item()
        
        return epoch_loss / len(iterator), epoch_acc / len(iterator)
   

In [None]:
INPUT_DIM = len(TEXT.vocab)
#You can try many different embedding dimensions. Common values are 20, 32, 64, 100, 128, 512
EMBEDDING_DIM = 512
OUTPUT_DIM = 1
#Get the index of the pad token using the stoi function

# From Github
# https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/5%20-%20Multi-class%20Sentiment%20Analysis.ipynb
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]


model = WordEmbAvg(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)


In [None]:
model = model.to(device)
tm =Training_module(model)

#Traing the model
best_model = tm.train_model(train_iterator, valid_iterator)

In [None]:
tm.model = best_model
test_loss, test_acc = tm.evaluate(test_iterator)
#Accuracy on the best data. Should be possible to get accuracy around 80-85%
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Now spend some time to tune your models. At least try the following: 

- try another optimizer
- change the number of epochs to train
- Add a dropout layer to the embedding layer
- Try different embedding sizes

Report your results and analysis in the writeup. 




Compute squared norms of the word vectors. List 10 words with highest norm and 10 with lowest norms. Explain your findings.

In [None]:
# Extract the weights from the linear layer 
weights = tm.model.linear.weight
embs = tm.model.emb

l2_word_vector = torch.matmul(embs.weight, weights.T)

# Highest
word_values, word_indices = torch.topk(input=l2_word_vector.T, k=10, largest=True) 

words = word_indices.tolist()[0]
print("\nHighest Norm Words")
print(f"Indices: {words}")

i = 1
for word in words: 
  print(f'#{i} {TEXT.vocab.itos[word]}')
  i += 1

# Lowest
word_values, word_indices = torch.topk(input=l2_word_vector.T, k=10, largest=False)  


words = word_indices.tolist()[0]
print("\nLowest Norm Words")
print(f"Indices: {words}")

i = 1
for word in words: 
  print(f'#{i} {TEXT.vocab.itos[word]}')
  i += 1

## Language Modeling

Use the code given below to upload your training, test and dev dataset if you're using Google collab for training. 

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import torch
import torch.nn as nn
USE_CUDA = torch.cuda.is_available()

In [None]:
import torchtext
from torchtext.vocab import Vectors

BATCH_SIZE = 32
EMBEDDING_SIZE = 650
MAX_VOCAB_SIZE = 50000
LOG_FILE = "language-model.log"

In [None]:
from torchtext import data
from torchtext import datasets
TEXT = data.Field(lower=True)
train, val, test = datasets.LanguageModelingDataset.splits(path=".", 
    train="text8.train.txt", validation="text8.dev.txt", test="text8.test.txt", text_field=TEXT)
TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
print("vocabulary size: {}".format(len(TEXT.vocab)))

VOCAB_SIZE = len(TEXT.vocab)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iter, val_iter, test_iter = data.BPTTIterator.splits(
    (train, val, test), batch_size=BATCH_SIZE, device= device, bptt_len=32, repeat=False)

How to get text and target using torchtext. The learning goal of our language model is to predict the next word using the previous words. 

In [None]:
it = iter(train_iter)
batch = next(it)
print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:,1].data]))
print(" ".join([TEXT.vocab.itos[i] for i in batch.target[:,1].data]))

## Define the model

In [None]:
class RNNModel(nn.Module):
    """ Container module with an encoder, a recurrent module, and a decoder.
        Feel free to add more methods into this class if necessary. 
    """

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
        ''' Create the layers of your model. You will need the following layers:
            - embedding layer
            - recurrent neural network layer (LSTM, GRU)
            - linear decoding layer to map from hidden vector to the vocabulary
            - optionally, add the dropout layer. The dropout layer can be put after 
                the embedding layer or/and after the RNN layer, or other places. 
            - optionally, initialize your model parameters. Look for papers/blogs 
                online for good parameter initialization methods. 
            Please read the documentation for how to build LSTM with PyTorch. 
            https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
        '''
        super(RNNModel, self).__init__()
        
        self.encoder = nn.Embedding(ntoken, ninp)
        self.drop = nn.Dropout(dropout)
        
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden):
        ''' The forward layer. You will need to do the following:
            - embed word index to word vectors
            - run RNN
            - linear layer to decode hidden vectors to output words
        '''
        
        input = self.encoder(input) # encode
        emb = self.drop(input) # embedings
        out, hidden = self.rnn(emb, hidden) # get hidden vectors
        out = self.drop(out) # drop out layer

        out = self.decoder(out) # decode the output words

        return out, hidden


    def init_hidden(self, bsz, requires_grad=True):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad),
                    weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad))
        else:
            return weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad)

Implement the evaluation script, return the loss. 

In [None]:
def evaluate(model, data):
    model.eval()
    total_loss = 0.
    it = iter(data)
    total_count = 0.
    with torch.no_grad():
        ''' Fill in your evaluation code here. The evaulation follows the same logic as training. 
        You might want to finish the training part first. 
        '''
    
        
        hidden = model.init_hidden(BATCH_SIZE, requires_grad=False)
        for i, batch in enumerate(it):
            data, target = batch.text, batch.target
            if USE_CUDA:
                data, target = data.cuda(), target.cuda()
            hidden = repackage_hidden(hidden)
            with torch.no_grad():
                output, hidden = model(data, hidden)
            loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
            total_count += np.multiply(*data.size())
            total_loss += loss.item()*np.multiply(*data.size())
            
    loss = total_loss / total_count
    model.train()
    return loss


In [None]:
import warnings
warnings.filterwarnings("ignore")

import copy
GRAD_CLIP = 1.
NUM_EPOCHS = 2

# Remove this part
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

model = RNNModel("GRU", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)
if USE_CUDA:
    model = model.cuda()
#Use cross entropy as your loss
loss_fn = torch.nn.CrossEntropyLoss() 
learning_rate = 0.001
#Choose your favorite Adam's optimizer
optimizer = torch.optim.Adam(model.parameters())
val_losses = []
for epoch in range(NUM_EPOCHS):
    model.train()
    it = iter(train_iter)
    hidden = model.init_hidden(BATCH_SIZE)
    for i, batch in enumerate(it):
        ''' The training code. You need to do the following:
            - prepare the training tensors, including the tensors of the history words and the predicted words
            - zero the model gradidents
            - predict the next words using the model
            - compute the cross entropy loss
            - do back propagation 
            - clip the gradients as we are training RNN
            - update the model with the gradidents
            - optionally, print out the batch loss after every 1000 iterations. 
            - optionally, evaluate your model on DEV after every 10000 iterations and save it to best_model. 
        '''
  
        
        data, target = batch.text, batch.target
        if USE_CUDA:
            data, target = data.cuda(), target.cuda()

        model.zero_grad()

        out, hidden = model(data, hidden)
        hidden = repackage_hidden(hidden)
        loss = loss_fn(out.view(-1, VOCAB_SIZE), target.view(-1))
        loss.backward()


        # apply gradient clipping to prevent the exploding gradient problem in RNN
        torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        
        optimizer.step() # update the model


        if i % 1000 == 0:
            print("epoch", epoch, "iter", i, "loss", loss.item())
    
        if i % 10000 == 0:
            val_loss = evaluate(model, val_iter)
            perplexity = 2**val_loss 
            with open(LOG_FILE, "a") as fout:
                print(f"epoch: {epoch}, iteration: {i}, perplexity: {perplexity}")
                fout.write(f"epoch: {epoch}, iteration: {i}, perplexity: {perplexity}\n"  )
                
            if len(val_losses) == 0 or val_loss < min(val_losses):
                print("best model, val loss: ", val_loss)
                
                # The following may not work on Colab.  Adapt it based on
                # https://discuss.pytorch.org/t/deep-copying-pytorch-modules/13514
                best_model = copy.deepcopy(model)
                
                with open("lm-best.th", "wb") as fout:
                    torch.save(best_model.state_dict(), fout)
            else:
                learning_rate /= 4.
                optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
            val_losses.append(val_loss)

In [None]:
val_loss = evaluate(best_model, val_iter)
print("perplexity: ", 2**val_loss)

#### Use the best model to evaluate the test dataset. 

In [None]:
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", 2**test_loss )

Generate some sentences. 

In [None]:
hidden = best_model.init_hidden(1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)
words = []
for i in range(100):
    output, hidden = best_model(input, hidden)
    word_weights = output.squeeze().exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    input.fill_(word_idx)
    word = TEXT.vocab.itos[word_idx]
    words.append(word)
print(" ".join(words))

In [None]:
# Greedy Decoding
hidden = best_model.init_hidden(1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)
words = []

output, hidden = best_model(input, hidden)
word_weights = output.squeeze().exp().cpu()
val, word_idx = torch.topk(input=word_weights, k=100)
word_idx_list = word_idx.data

for idx in word_idx: 
  input.fill_(idx)
  word = TEXT.vocab.itos[idx]
  words.append(word)
print(" ".join(words))



# Best Results

## Using 5 epochs


```
epoch 0 iter 0 loss 10.82283878326416
epoch: 0, iteration: 0, perplexity: 1757.6632350478521
best model, val loss:  10.779442963978829
epoch 0 iter 1000 loss 6.506584167480469
epoch 0 iter 2000 loss 6.348207473754883
epoch 0 iter 3000 loss 6.102053642272949
epoch 0 iter 4000 loss 5.419452667236328
epoch 0 iter 5000 loss 5.919216156005859
epoch 0 iter 6000 loss 5.8455810546875
epoch 0 iter 7000 loss 5.6072678565979
epoch 0 iter 8000 loss 5.785408020019531
epoch 0 iter 9000 loss 5.4222731590271
epoch 0 iter 10000 loss 5.598560333251953
epoch: 0, iteration: 10000, perplexity: 38.53165289463282
best model, val loss:  5.267972169588238
epoch 0 iter 11000 loss 5.6927289962768555
epoch 0 iter 12000 loss 5.678534984588623
epoch 0 iter 13000 loss 5.372802734375
epoch 0 iter 14000 loss 5.291945934295654
epoch 1 iter 0 loss 5.660902976989746
epoch: 1, iteration: 0, perplexity: 34.76176337103176
best model, val loss:  5.11942936294636
epoch 1 iter 1000 loss 5.579431533813477
epoch 1 iter 2000 loss 5.623998641967773
epoch 1 iter 3000 loss 5.434945106506348
epoch 1 iter 4000 loss 4.951798915863037
epoch 1 iter 5000 loss 5.430264472961426
epoch 1 iter 6000 loss 5.405932426452637
epoch 1 iter 7000 loss 5.3224263191223145
epoch 1 iter 8000 loss 5.420109748840332
epoch 1 iter 9000 loss 5.1500983238220215
epoch 1 iter 10000 loss 5.285759449005127
epoch: 1, iteration: 10000, perplexity: 30.813975719112218
best model, val loss:  4.945512930468414
epoch 1 iter 11000 loss 5.3877716064453125
epoch 1 iter 12000 loss 5.361586093902588
epoch 1 iter 13000 loss 5.165499687194824
epoch 1 iter 14000 loss 5.0832414627075195
epoch 2 iter 0 loss 5.469143390655518
epoch: 2, iteration: 0, perplexity: 29.996484974268075
best model, val loss:  4.90672154869852
epoch 2 iter 1000 loss 5.311384677886963
epoch 2 iter 2000 loss 5.427046775817871
epoch 2 iter 3000 loss 5.298762321472168
epoch 2 iter 4000 loss 4.734654426574707
epoch 2 iter 5000 loss 5.279272079467773
epoch 2 iter 6000 loss 5.2258148193359375
epoch 2 iter 7000 loss 5.187764644622803
epoch 2 iter 8000 loss 5.244477272033691
epoch 2 iter 9000 loss 5.050817012786865
epoch 2 iter 10000 loss 5.070919036865234
epoch: 2, iteration: 10000, perplexity: 28.432046544494217
best model, val loss:  4.829446043122923
epoch 2 iter 11000 loss 5.249824047088623
epoch 2 iter 12000 loss 5.231677532196045
epoch 2 iter 13000 loss 5.055905342102051
epoch 2 iter 14000 loss 5.032273292541504
epoch 3 iter 0 loss 5.37913179397583
epoch: 3, iteration: 0, perplexity: 28.1427962437409
best model, val loss:  4.814693775507217
epoch 3 iter 1000 loss 5.192203998565674
epoch 3 iter 2000 loss 5.328357696533203
epoch 3 iter 3000 loss 5.265546798706055
epoch 3 iter 4000 loss 4.672381401062012
epoch 3 iter 5000 loss 5.262495994567871
epoch 3 iter 6000 loss 5.129266262054443
epoch 3 iter 7000 loss 5.055858135223389
epoch 3 iter 8000 loss 5.183602333068848
epoch 3 iter 9000 loss 4.950128555297852
epoch 3 iter 10000 loss 5.034492015838623
epoch: 3, iteration: 10000, perplexity: 27.25127013620251
best model, val loss:  4.768251567951126
epoch 3 iter 11000 loss 5.181535720825195
epoch 3 iter 12000 loss 5.18128776550293
epoch 3 iter 13000 loss 5.033013343811035
epoch 3 iter 14000 loss 4.940471172332764
```


## Results

Best Training perplexity:  27.25127013620251

Test perplexity:  32.441465873566145

## Sentences

### Multinomial

s jaws was regularly grounded for scavengers on the prevention of amber its transfer head that had to bite buster his steam orbiter to cure anyone was torso until danger of and stroke dragged from my mouth to practice <unk> remarked after the poison cared for the sketch of the sea alters their health of struggle that would return at a battle the short attempt for connally client robert common earlier evidence on the voyage of miss lamia may have never had a number of swords than even before balloons or behave less would isolate this object effects of the


bergman walpole college of oz costly kin <unk> <unk> american football team name play for translations page knee pda progress called a good permission that applies stolen by sixty have of typing and quality reference australian history several videos from a series entitled development mailing list readings but managed for society external links entry and <unk> sites to help adobe <unk> a for discovery includes malaria coverage com int beta ratio segregated of drinking statistical travel and knowledge of carbon dioxide energy energy element <unk> john walton dk from unintended life to disease with a spotlight at home each of

pronounced the <unk> k which covers the first punic forces because lunar saharan orders visited almost all the <unk> became the world transportation act it can fallen into the united states polish architecture the element of the structures w jorge wilmington was patented on july two six one nine five eight as well as the two zero zero five nobel prize winners encounter he asserts that there has failed to settle off of surface contained and gaining success from the investments so he is stupid for its use at the national sit in princeton metropolitan new york city famous film

### Greedy Decoder

<unk> the and of in one s a is for on as two that which to also was or are by with three it who i other this at from an e but see have has four de can where five such were so external will m zero eight most including six first seven often about while more may nine all however called some new now used like based these g main not he no although many another do there v university d la they image high o r com sometimes below thus his ii found later state references p


and conference university encyclopedia period in law fiction <unk> community the assembly of order era edition center council one series time is article review was convention s system society congress institute tradition park league history college book school events text view wars sites movement on crisis meeting model library site or day equation as description rule definition research with version project organization discussion records revolution accounts awards age church agreement see but entry interpretation reports hall campaign style map process cycle foundation analysis studies states film account office work vision opinion comics writers historians atlas calendar issues institution academy works

microsystems iv <unk> s devils iii ii i and brown smith yat jones bonds one phi v hunt lewis o ndez viper lia todd the adams tley e two vi loves byrd island g jackson h clark bonaparte massachusetts genesis dioxide theta rogers wyatt bay richards green coffey shall brien roses hare robertson stewart jenkins clarke rays x mandela klan alpha bond shirt ashby springs anderson xi fairbanks wallace hamilton cells wolf else walks viii edwards monroe tree xii roberts is beta doyle johnson goes austen rouge w pigs grey atoms collins pi van epsilon lane fraternity mary blake t


# Experimentation

Increasing the number of epochs reduces the perplexity, but only 2 were used in the above after a few trials for time. 

Trying withGated Recurrent Unit ("GRU") was also an improvement. 