# Project 3: Emotion detection with Neural Networks
## CS4740/5740 Fall 2020

Names:

Netids:

### Project Submission Due: November 13th
Please submit **pdf file** of this notebook on **Gradescope**, and **ipynb** on **CMS**. For instructions on generating pdf and ipynb files, please refer to project 1 instructions.



## Introduction
In this project we will consider **neural networks**: first a Feedforward Neural Network (FFNN) and second a Recurrent Neural Network (RNN), for performing a 5-class emotion detection task.

The project is divided into parts. In **Part 1**, you will be given an implementation for a FFNN and be asked to debug it in a specific way. In **Part 2**, you will then implement an RNN model for performing the same task. In **Part 3**, you will analyze these two models in two types of comparative studies and in **Part 4** you will answer questions describing what you have learned through this project. You also will be required to submit a description of libraries used, how your group divided up the work, and your feedback regarding the assignment (**Part 5**).

## Advice 🚀
As always, the report is important! The report is where you get to show
that you understand not only what you are doing but also why and how you are doing it. So be clear, organized and concise; avoid vagueness and excess verbiage. Spend time doing error analysis for the models. This is how you understand the advantages and drawbacks of the systems you build. The reports should read more like the papers that we have been writing critiques for.

All throughout the report you may be asked to place images, plots, etc. Feel free to write code that will generate the plots for you and use those or generate them some other way and insert into the colab. To add images in your colab, these are a few possible ways to do it!

1. Copy and paste the image in markdown! Yes this really does work

2. Upload to google drive, get a shareable link. It will be something like:

```
https://drive.google.com/file/d/1xDrydbSbijvK2JBftUz-5ovagN2B_RWH/view?usp=sharing
```
We want just the id which is `1xDrydbSbijvK2JBftUz-5ovagN2B_RWH` and the link we will use is:

```
https://drive.google.com/uc?export=view&id=your_id
```

Then in markdown you'd write the following:

```markdown
![image](https://drive.google.com/uc?export=view&id=1xDrydbSbijvK2JBftUz-5ovagN2B_RWH)
```

3. Using IPython!
```python
from IPython.display import Image
Image(filename="drive/GPU/data/iris.PNG")
```

4. Using your connected GDrive
```markdown
![iris](drive/GPU/data/iris.PNG)
```

## Dataset
You are given access to a set of tweets. These tweets have an associated
emotion $y \in Y := \{anger, fear, joy, love, sadness\}$. For this project, given the review text, you will
need to predict the associated rating, y. This is sometimes called fine-grained sentiment analysis in the literature; we will simply refer to it as sentiment analysis in this project.

We will minimally preprocess the reviews and handle tokenization in what we re-
lease. For this assignment, we do not anticipate any further preprocessing to be done by you. Should you choose to do so, it would be interesting to hear about in the report (along with whether or not it helped performance), but it is not a required aspect of the assignment.


In [None]:
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=True)

train_path = os.path.join(os.getcwd(), "drive", "My Drive", "p3", "p3_train.txt") # replace based on your Google drive organization
val_path = os.path.join(os.getcwd(), "drive", "My Drive",  "p3", "p3_val.txt") # replace based on your Google drive organization
test_path = os.path.join(os.getcwd(), "drive", "My Drive",  "p3", "p3_test_no_labels.txt") # replace based on your Google drive organization

Mounted at /content/drive


# Part 1: Feedforward Neural Network

In this section, there are two main components relevant to **Part 1**.

1. `Data loader`\
As the name suggests, this section loads the data from the dataset files and handles other preprocessing and setup. You will **not** need to change this file and should **not** change this file throughout the assignment.

2. `ffnn`\
This contains the model and code that uses the model for **Part 1**

In the `ffnn` section, you will find a Feedforward Neural Net serving as the underlying model for performing emotion detection.



## Part 1: Tips

We do not assume you have **any** experience working with neural networks and/or debugging them. You may discover this process, while similar, is quite different from debuging in general software engineering and from debugging in other domains such as algorithms and systems.

We suggest you systematically step through the code and simultanously (perhaps by physically drawing it out) describe what the computations _mean_. What you are looking for is where the code differs from what is expected.

## Part 1: Rules

For **Part 1**, you will not be able to ask any questions on Piazza and we will be unable to provide any meaningful advice in office hours. Unfortunately, this is the nature of debugging, it is unlikely anyone can give you specific advice for most problems you encounter and we have already provided general tips in the preceding section, If you absolutely must ask a question or you believe there is some kind of issue with the assignment for this part, please submit a private Piazza post and we will respond swiftly.

As a reminder **communication about the assignment _between_ distinct groups is not permissed and is a violation of the Academic Integrity policy** For this assignment, we will be _extremely_ stringent about this, given that debugging is entirely pointless if someone else in a different group tells you where the error is.

## Import libraries and connect to Google Drive

In [1]:
import json
import math
import os
from pathlib import Path
import random
import time
from tqdm.notebook import tqdm, trange
from typing import Dict, List, Set, Tuple
import torch.nn.functional as F
import numpy as np
import torch
import torch.nn as nn
from torch.nn import init
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from tqdm.notebook import tqdm, trange

## Data loader

In [None]:
emotion_to_idx = {
    "anger": 0,
    "fear": 1,
    "joy": 2,
    "love": 3,
    "sadness": 4,
    "surprise": 5,
}
idx_to_emotion = {v: k for k, v in emotion_to_idx.items()}
UNK = "<UNK>"
PAD = "<PAD>"

In [None]:
def fetch_data(train_data_path, val_data_path, test_data_path):
    """fetch_data retrieves the data from a json/csv and outputs the validation
    and training data

    :param train_data_path:
    :type train_data_path: str
    :return: Training, validation pair where the training is a list of document, label pairs
    :rtype: Tuple[
        List[Tuple[List[str], int]],
        List[Tuple[List[str], int]],
        List[List[str]]
    ]
    """
    with open(train_data_path) as training_f:
        training = training_f.read().split("\n")[1:-1]
    with open(val_data_path) as valid_f:
        validation = valid_f.read().split("\n")[1:-1]
    with open(test_data_path) as testing_f:
        testing = testing_f.read().split("\n")[1:-1]
	
    # If needed you can shrink the training and validation data to speed up somethings but this isn't always safe to do by setting k < 10000
    # k = #fill in
    # training = random.shuffle(training)
    # validation = random.shuffle(validation)
    # training, validation = training[:k], validation[:(k // 10)]

    tra = []
    val = []
    test = []
    for elt in training:
        if elt == '':
            continue
        txt, emotion = elt.split(",")
        tra.append((txt.split(" "), emotion_to_idx[emotion]))
    for elt in validation:
        if elt == '':
            continue
        txt, emotion = elt.split(",")
        val.append((txt.split(" "), emotion_to_idx[emotion]))
    for elt in testing:
        if elt == '':
            continue
        txt = elt
        test.append(txt.split(" "))

    return tra, val, test

In [None]:
def make_vocab(data):
		"""make_vocab creates a set of vocab words that the model knows

		:param data: The list of documents that is used to make the vocabulary
		:type data: List[str]
		:returns: A set of strings corresponding to the vocabulary
		:rtype: Set[str]
		"""
		vocab = set()
		for document, _ in data:
				for word in document:
						vocab.add(word)
		return vocab 


def make_indices(vocab):
	"""make_indices creates a 1-1 mapping of word and indices for a vocab.

	:param vocab: The strings corresponding to the vocabulary in train data.
	:type vocab: Set[str]
	:returns: A tuple containing the vocab, word2index, and index2word.
		vocab is a set of strings in the vocabulary including <UNK>.
		word2index is a dictionary mapping tokens to its index (0, ..., V-1)
		index2word is a dictionary inverting the mapping of word2index
	:rtype: Tuple[
		Set[str],
		Dict[str, int],
		Dict[int, str],
	]
	"""
	vocab_list = sorted(vocab)
	vocab_list.append(UNK)
	
	word2index = {}
	index2word = {}
	for index, word in enumerate(vocab_list):
		word2index[word] = index 
		index2word[index] = word 
	vocab.add(UNK)
	return vocab, word2index, index2word 


def convert_to_vector_representation(data, word2index, test=False):
	"""convert_to_vector_representation converts the list of strings into a vector

	:param data: The dataset to be converted into a vectorized format
	:type data: Union[
		List[Tuple[List[str], int]],
		List[str],
	]
	:param word2index: A mapping of word to index
	:type word2index: Dict[str, int]
	:returns: A list of vector representations of the input or pairs of vector
		representations with expected output
	:rtype: List[Tuple[torch.Tensor, int]] or List[torch.Tensor]

	List[Tuple[List[torch.Tensor], int]] or List[List[torch.Tensor]]
	"""
	if test:
		vectorized_data = []
		for document in data:
			vector = torch.zeros(len(word2index)) 
			for word in document:
				index = word2index.get(word, word2index[UNK])
				vector[index] += 1
			vectorized_data.append(vector)
	else:
		vectorized_data = []
		for document, y in data:
			vector = torch.zeros(len(word2index)) 
			for word in document:
				index = word2index.get(word, word2index[UNK])
				vector[index] += 1
			vectorized_data.append((vector, y))
	return vectorized_data

In [None]:
class EmotionDataset(Dataset):
    """EmotionDataset is a torch dataset to interact with the emotion data.

    :param data: The vectorized dataset with input and expected output values
    :type data: List[Tuple[List[torch.Tensor], int]]
    """
    def __init__(self, data):
        self.X = torch.cat([X.unsqueeze(0) for X, _ in data])
        self.y = torch.LongTensor([y for _, y in data])
        self.len = len(data)
    
    def __len__(self):
        """__len__ returns the number of samples in the dataset.

        :returns: number of samples in dataset
        :rtype: int
        """
        return self.len
    
    def __getitem__(self, index):
        """__getitem__ returns the tensor, output pair for a given index

        :param index: index within dataset to return
        :type index: int
        :returns: A tuple (x, y) where x is model input and y is our label
        :rtype: Tuple[torch.Tensor, int]
        """
        return self.X[index], self.y[index]



def get_data_loaders(train, val, batch_size=16):
    """
    """
    # First we create the dataset given our train and validation lists
    dataset = EmotionDataset(train + val)

    # Then, we create a list of indices for all samples in the dataset
    train_indices = [i for i in range(len(train))]
    val_indices = [i for i in range(len(train), len(train) + len(val))]

    # Now we define samplers and loaders for train and val
    train_sampler = SubsetRandomSampler(train_indices)
    train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
    
    val_sampler = SubsetRandomSampler(val_indices)
    val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_sampler)

    return train_loader, val_loader


In [None]:
train, val, test = fetch_data(train_path, val_path, test_path)

In [None]:
vocab = make_vocab(train)
vocab, word2index, index2word = make_indices(vocab)
train_vectorized = convert_to_vector_representation(train, word2index)
val_vectorized = convert_to_vector_representation(val, word2index)
test_vectorized = convert_to_vector_representation(test, word2index, True)

In [None]:
train_loader, val_loader = get_data_loaders(train_vectorized, val_vectorized,batch_size = 16)

In [None]:
# Note: Colab has 12 hour limits on GPUs, also potential inactivity may kill the notebook. Save often!

## 1.1 FFNN Implementation

### 1.1 Task
Assume that an onmiscient oracle has told you there are **4 fundamental errors** in the **FFNN** implementation. They may be anywhere in this section unless otherwise indicated. Your objective is to _find_ and _fix_ each of these errors and to include in the report a description of the original error along with the fix. To help your efforts, the oracle has provided you with additional information about the properties of the errors as follows:

* _Correctness_ \
Each error causes the code to be strictly incorrect. There is absolutely no ambiguity that the errant code (or missing code) is incorrect. This means errors are not due to the code being inefficient (in run-time or in memory).

* _Localized_ \
Each error can be judged to be erroneous by strictly looking at the code (along with your knowledge of machine learning as taught through this course). The errors therefore are not due to the model being uncompetitive in terms of performance with state-of-the-art performance for this task nor are they due to the amount of data being insufficient for this task in general.

* _General_ \
Each error is general in nature. They will not be triggered by the model receiving a pathological input, i.e. they will not be something that is triggered specifically when NLP is referenced with negative sentiment.

* _Fundamental_ \
Each error is a fundamental failure in terms of doing what is intended. This means that errors do not hinge on nuanced understanding of specific PyTorch functionality. This also means they will not exploit properties of the dataset in
a subtle way that could only be realized by someone who has comprehensively studied the data.

The bottom line: the errors should be fairly obvious. The oracle further reminds you that performance/accuracy of the (resulting) model should not be how you ensure you have debugged successfully. For example, if you correct some, but not all, of the errors, the remaining errors may mask the impact of your fixes. Further, performance is not guaranteed to improve by fixing any particular error. Consider the case where the training set is also employed as the test set; performance will be very high but there is something very wrong. And fixing the problem will reduce performance.
In fixing each error, the oracle provides some further insight about the fixes:

* _Minimal_ \
A reasonable fix for each error can be achieved in < 5 lines of code being changes. We do not require you to make fixes of 4 of fewer lines, but it should be a cause for concern if your fixes are far more elaborate

* _Ill-posed_ \
While the errors are unambiguous, the method for fixing them is under-specified: You are free to implement any reasonable fix and all such fixes will equally recieve full credit.

In [None]:
# Lambda to switch to GPU if available
get_device = lambda : "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
unk = '<UNK>'

# Consult the PyTorch documentation for information on the functions used below:
# https://pytorch.org/docs/stable/torch.html

class FFNN(nn.Module):
  def __init__(self, input_dim, h, output_dim):
    super(FFNN, self).__init__()
    self.h = h
    self.W1 = nn.Linear(input_dim, h)
    self.activation = nn.ReLU() # The rectified linear unit; one valid choice of activation function
    self.W2 = nn.Linear(h, output_dim)
    # The below two lines are not a source for an error
    self.softmax = nn.LogSoftmax(dim=1) # The softmax function that converts vectors into probability distributions; computes log probabilities for computational benefits
    self.loss = nn.NLLLoss() # The cross-entropy/negative log likelihood loss taught in class

  def compute_Loss(self, predicted_vector, gold_label):
    return self.loss(predicted_vector, gold_label)

  def forward(self, input_vector):
    # The z_i are just there to record intermediary computations for your clarity
    z1 = self.W1(input_vector)
    relu = self.activation(z1)
    z2 = self.W2(relu)
    predicted_vector = self.softmax(z2)
    return predicted_vector
  
  def load_model(self, save_path):
    self.load_state_dict(torch.load(save_path))
  
  def save_model(self, save_path):
    torch.save(self.state_dict(), save_path)


def train_epoch(model, train_loader, optimizer):
  model.train()
  total = 0
  loss = 0
  correct = 0
  for (input_batch, expected_out) in tqdm(train_loader, leave=False, desc="Training Batches"):
    output = model(input_batch.to(get_device()))
    #print(output)
    total += output.size()[0]
    _, predicted = torch.max(output, 1)
    correct += (expected_out == predicted.to("cpu")).cpu().numpy().sum()

    loss = model.compute_Loss(output, expected_out.to(get_device()))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(correct)
  # Print accuracy

  return


def evaluation(model, val_loader, optimizer):
  model.eval()
  loss = 0
  correct = 0
  total = 0
  for (input_batch, expected_out) in tqdm(val_loader, leave=False, desc="Validation Batches"):
    output = model(input_batch.to(get_device()))
    total += output.size()[0]
    _, predicted = torch.max(output, 1)
    correct += (expected_out.to("cpu") == predicted.to("cpu")).cpu().numpy().sum()

    loss += model.compute_Loss(output, expected_out.to(get_device()))
  loss /= len(val_loader)
  # Print validation metrics

  pass

def train_and_evaluate(number_of_epochs, model, train_loader, val_loader):
  optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
  for epoch in trange(number_of_epochs, desc="Epochs"):
    train_epoch(model, train_loader, optimizer)
    evaluation(model, val_loader, optimizer)
  return

In [None]:
h = 512
model = FFNN(len(vocab), h, len(emotion_to_idx)).to(get_device())
train_and_evaluate(2, model, train_loader, val_loader)
model.save_model("ffnn_fixed.pth") # Save our model!

In [None]:
print(model)

FFNN(
  (W1): Linear(in_features=11832, out_features=512, bias=True)
  (activation): ReLU()
  (W2): Linear(in_features=512, out_features=512, bias=True)
  (softmax): LogSoftmax(dim=1)
  (loss): NLLLoss()
)


In [None]:
# Example of how to load
loaded_model = FFNN(len(vocab), h, len(emotion_to_idx))
loaded_model.load_model("ffnn_fixed.pth")

## 1.2 Part 1 Report
Please include a description of the error, a description of your fix, and a python comment indicating the fix for each of the 4 errors.

### Error 1:
Your answer here.

didnt do relu activaction in forward func, causing gradient vanishing.

### Error 2:
Your answer here.

### Error 3:
Your answer here.

### Error 4:
Your answer here.

# Part 2: Recurrent Neural Network
Recurrent neural networks have been the workhorse of NLP for a number of years. A fundamental reason for this success is they can inherently deal with _variable_ length sequences. This is axiomatically important for natural language; words are formed from a variable number of characters, sentences from a variable number of words, paragraphs from a variable number of sentences, and so forth. This differs from a field like Computer Vision where images are (generally) of a fixed size.
<br></br>
This is also very different scenario than that of the classifiers we have studied (e.g.Naive Bayes, Perceptron Learning, Feedforward Neural Networks), which take in a
fixed-length vector.
<br></br>
To clarify this, we can think of the _types_ of the mathematical functions described by a FFNN and an RNN. What is pivotal in what follows is that k need not be constant
across examples.

$\textbf{FFNN.}$ \
$Input: \vec{x} \in \mathcal{R}^d$ \
$Model\text{ }Output: \vec{z} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$
$Final\text{ }Output: \vec{y} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the contraint of being a probability distribution, ie $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}[i] \leq 1$, which is achieved via _Softmax_ applied to $\vec{z}$.
<br></br>
$\textbf{RNN.}$ \
$Input: \vec{x}_1,\vec{x}_2, \dots, \vec{x}_k; \vec{x}_i \in \mathcal{R}^d$ \
$Model\text{ }Output: \vec{z}_1,\vec{z}_2, \dots, \vec{z}_k; \vec{z}_i \in \mathcal{R}^{h}$
$Final\text{ }Output: \vec{y} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the contraint of being a probability distribution, ie $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}[i] \geq 0$, which is achieved by the process described later in this report and as you have seen in class

Intuitively, an RNN takes in a sequence of vectors and computes a new vector corresponding to each vector in the original sequence. It achieves this by processing the input sequence one vector at a time to (a) compute an updated representation of the entire sequence (which is then re-used when processing the next vector in the input sequence), and (b) produce an output for the current position. The vector computed in (a) therefore not only contains information about the current input vector but also about the previous input vectors. Hence, $\vec{z}_j$ is computed after having observed $\vec{x}_1, \dots, \vec{x}_j$. As such, a simple observation is we can treat the last vector computed by the RNN, ie $\vec{z}_k$ as a representation of the entire sequence. Accordingly, we can use this as the input to a single-layer linear classifier to compute a yector $\vec{y}$ as we will need for classification.

$$\vec{y} = Softmax(W\vec{z}_k); W\in \mathcal{R}^{\mid \mathcal{Y}\mid \times h}$$

## Part 2: Rules
**Part 2** requires implementing a rudimentary RNN in PyTorch for text classification. Countless blog posts, internet tutorials and other implementations available publicly (and privately) do precisely this. In fact, almost every student in [Cornell NLP](https://nlp.cornell.edu/people/) likely has some code for doing this on their Github. You **cannot** use any such code (though you may use anything you find in course notes or course texts) irrespective of whether you cite it or do not.

Submissions will be passed through the MOSS system, which is a sophisticated system for detecting plagiarism in code and is robust in the sense that it tries to find alignments in the underlying semantics of the code and not just the surface level syntax. Similarly, the course staff are also quite astute with respect to programming neural models for NLP and we will strenuously look at your code. We flagged multiple groups for this last year, so we strongly suggest you resist any such temptation (if the Academic Integrity policy alone is insufficient at dissuading you).

## 2.1 RNN Implementation

Similar to **Part 1**, we have the previous `Data loader` section and the new `RNN` component. We don't envision that it will be useful to modify the `Data loader`. We have included some stubs to help give you a place to start for the RNN.

Additionally, we remind you that Part 1 furnishes a near-functional implementation of a similar neural model for the same task. If you successfully do Part 1 correctly, it will be wholely functional. Using it as a template for Part 2 is both prudent and suggested.

In [None]:
import torch.nn.utils.rnn as rnn_utils
def rnn_preprocessing(data, word2index, test=False):
  """rnn_preprocessing

  :param data:
  :type data:
  :param test:
  :type test:
  """
  # Do some preprocessing similar to convert_to_vector_representation
  # For the RNN, remember that instead of a single vector per training
  # example, you will have a sequence of vectors where each vector
  # represents some information about a specific token.
  if test:
    vectorized_data = []
    max_l = find_max_seq_len(train)
    for document in data:
      vector = torch.zeros(max_l,1) 
      j = 0
      for word in document:
        index = word2index.get(word)
        if index == None:
          index = word2index.get(unk)
        if index == 0:
          index = 11832
        #print(index)
        vector[j] = index
        j = j +1
      vectorized_data.append(vector)
  else:
    vectorized_data = []
    #ith document
    max_l = find_max_seq_len(train)
    #print(max_l)
    #el = find_each_seq_len(data)
    for document, y in data:
      vector = torch.zeros(max_l, 1)
      #print(vector.size())
      j = 0  #jth seq in doc
      for word in document:
        index = word2index.get(word)
        if index == None:
          index = word2index.get(unk)
        if index == 0:
          index = 11832
        #print(index)
        vector[j] = index
        #print(vector.size())
        j = j + 1
      #packed_sequence = rnn_utils.pack_sequence([torch.FloatTensor(i) for i in vector]) 
      #pad_seq = rnn_utils.pad_sequence(vector, batch_first=True, padding_value=0)
      vectorized_data.append((vector, y))
  return vectorized_data

In [None]:
#this func will find the max seuence length in our data
def find_max_seq_len(data):
  max_seq = 0
  for document in data:
    if len(document[0]) > max_seq:
      max_seq = len(document[0])
  return max_seq

find_max_seq_len(train)

#this func will find the length of each sequence and put everything to a list
def find_each_seq_len(data):
  len_seq = []
  for document in data:
    len_seq.append(len(document[0]))
  return len_seq
#find_each_seq_len(train)

In [None]:
train_vectorized = rnn_preprocessing(train, word2index)
unk = '<UNK>'
val_vectorized = rnn_preprocessing(val, word2index)
test_vectorized = rnn_preprocessing(test, word2index, True)

#load data in a chosen batchsize 
train_loader, val_loader = get_data_loaders(train_vectorized, val_vectorized,batch_size = 64)

In [None]:
#this func will find the length of padded sequence and append eveything to a list
def find_sent_len(data):
  lent = []
  for i in range(len(data)):
    ct = 0
    for j in range(66):
      if data[i][j] == 0:
        ct = ct + 1
    lent.append(66-ct)
  return lent

In [None]:
#embedding test dim
test_emb = nn.Embedding(len(vocab)+1,100,padding_idx = 0)
for item,label in train_loader:
  print(item.size())
  item = torch.tensor(item).to(torch.int64)
  item = torch.transpose(item,0,1).squeeze(-1)
  print(item.size())
  test_emb(item)

torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size([64, 66, 1])
torch.Size([66, 64])
torch.Size

  """


In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, emb, h_size, output_size):
        
        super().__init__()
        
        self.emb = nn.Embedding(input_size, emb, padding_idx = 0)
        
        self.rnn = nn.RNN(emb, h_size, num_layers=2, dropout=0.2, nonlinearity='relu')
        
        self.fc = nn.Linear(h_size*2, output_size)
        self.relu = nn.ReLU
        self.dropout = nn.Dropout(0.2)

        # Ensure parameters are initialized to small values, see PyTorch documentation for guidance
        self.softmax = nn.LogSoftmax(dim=1)
        self.loss = nn.NLLLoss()
        
    def forward(self, seq, text_len):

        #sequence = [seq_len, batch_size]
        #print(seq.size())
        seq = torch.transpose(seq,0,1).squeeze(-1)
        #print(seq.size())
        emb = self.emb(seq)

        #enable dropout after embedding layer
        emb = self.dropout(emb)

        #pack the emb, not descending order, need to set enoforce_sorted = False
        #To avoid the end of the sequence containning bad information & useless vectors we pack the data with its original length
        packed_emb = rnn_utils.pack_padded_sequence(emb, text_len, enforce_sorted=False)

        #print(emb.size())
        #embedding = [seq_len, batch_size, emb_dim]
      
        #print(emb.size())
        output, hidden = self.rnn(packed_emb)

        output, output_lengths = rnn_utils.pad_packed_sequence(output)
        #padding token for output are all zero tensors
        
        #print(output.size())
        #print(hidden.size())


        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        return self.fc(hidden)

    def load_model(self, save_path):
        self.load_state_dict(torch.load(save_path))
  
    def save_model(self, save_path):
        torch.save(self.state_dict(), save_path)

    def init_hidden(self):
        hidden = torch.autograd.Variable(torch.zeros(1, self.h_size))
        return hidden

In [None]:

#set up RNN model
hidden_layer = 1024
out_size = 6
rnn = RNN(len(vocab)+1, 100, hidden_layer, out_size)
print(rnn)
rnn = rnn.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn.parameters(), lr=0.0001)
#optimizer = optim.SGD(rnn.parameters(), lr=0.1, momentum=0.09)
#learning_rate = 0.005






RNN(
  (emb): Embedding(11833, 100, padding_idx=0)
  (rnn): RNN(100, 1024, num_layers=2, dropout=0.2)
  (fc): Linear(in_features=2048, out_features=6, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (softmax): LogSoftmax(dim=1)
  (loss): NLLLoss()
)


In [None]:
def count_acc(pre, label):
  correct = 0
  for i in range(len(label)):
    if pre[i] == label[i]:
      correct += 1
  return correct


def train_func(model, itr, optimizer, criterion):
    
    loss_per_epoch = 0

    #correct predicitions
    corrects = 0
    model.train()
    #model.to(device)
    
    for batch, label in itr:
        batch =  torch.tensor(batch).to(torch.int64)
        #print(batch[0])
        batch = batch.cuda()
        label = label.cuda()
        #hidden = rnn.init_hidden()
        #for i in range(batch[0][0].size()[0]):
          #print(i)
          #output, hidden = rnn(batch[0][0][i],hidden)
         
        optimizer.zero_grad()
        #print(batch.shape)
                
        output = model(batch, find_sent_len(batch))


        _,predictions = torch.max(output, 1)
        #print(predictions)
        #print(label)

        #print(output.shape)
        
        loss = criterion(output, label)
        #print(loss)
        corrects += count_acc(predictions, label)
       
        
        loss.backward()
        
        optimizer.step()
        
        loss_per_epoch += loss.item()
        
    #return the average loss and num of corrects predictions    
    return loss_per_epoch / len(itr), corrects
    
    
def evaluation_func(model, itr, criterion):
    
    loss_per_epoch = 0
    acc_per_epoch = 0
    corrects = 0
    model.eval()
    #make sure no gradients are calculated
    with torch.no_grad():
    
        for batch, label in itr:
            batch =  torch.tensor(batch).to(torch.int64)
            batch = batch.cuda()
            label = label.cuda()

            output = model(batch, find_sent_len(batch))

            _,predictions = torch.max(output, 1)
            
            loss = criterion(output, label)
            
            #acc = binary_accuracy(predictions, label)
            #if label == predictions:
              #corrects += 1
            corrects += count_acc(predictions, label)
            
        

            loss_per_epoch += loss.item()
           
    #return the average loss and num of corrects predictions
    return loss_per_epoch / len(itr),  corrects

In [None]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import torch
torch.backends.cudnn.enabled = False

#use cross entropy loss as loss func
criterion = nn.CrossEntropyLoss()
criterion = criterion.cuda()

#modified batch_size
batch_size = 64

#training epoch
num_epochs = 20

#stop threshold
best_loss = float('inf')

for epoch in range(num_epochs):

    
    train_loss, train_acc = train_func(rnn, train_loader, optimizer, criterion)
    val_loss, val_acc = evaluation_func(rnn,  val_loader, criterion)
    
    #keep track of best loss
    if val_loss < best_loss:
        best_loss = val_loss
        torch.save(rnn.state_dict(), 'rnn_fixed.pth')
    #print our the training performace for each epoch
    print(f'num of epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} || Train Acc: {(train_acc/(len(train_loader)*batch_size))*100:.2f}%')
    print(f'\t Val. Loss: {val_loss:.3f} ||  Val. Acc: {(val_acc/(len(val_loader)*batch_size))*100:.2f}%')



num of epoch: 01
	Train Loss: 0.719 || Train Acc: 75.04%
	 Val. Loss: 1.042 ||  Val. Acc: 58.20%
num of epoch: 02
	Train Loss: 0.585 || Train Acc: 79.78%
	 Val. Loss: 0.938 ||  Val. Acc: 67.66%
num of epoch: 03
	Train Loss: 0.501 || Train Acc: 82.64%
	 Val. Loss: 0.714 ||  Val. Acc: 73.28%
num of epoch: 04
	Train Loss: 0.442 || Train Acc: 84.84%
	 Val. Loss: 0.690 ||  Val. Acc: 75.94%
num of epoch: 05
	Train Loss: 0.460 || Train Acc: 83.99%
	 Val. Loss: 0.725 ||  Val. Acc: 72.81%
num of epoch: 06
	Train Loss: 0.404 || Train Acc: 86.07%
	 Val. Loss: 0.845 ||  Val. Acc: 68.44%
num of epoch: 07
	Train Loss: 0.379 || Train Acc: 87.27%
	 Val. Loss: 0.702 ||  Val. Acc: 76.80%
num of epoch: 08
	Train Loss: 0.312 || Train Acc: 89.60%
	 Val. Loss: 0.744 ||  Val. Acc: 74.45%
num of epoch: 09
	Train Loss: 0.291 || Train Acc: 90.20%
	 Val. Loss: 0.640 ||  Val. Acc: 78.20%
num of epoch: 10
	Train Loss: 0.273 || Train Acc: 90.69%
	 Val. Loss: 0.696 ||  Val. Acc: 77.03%
num of epoch: 11
	Train Loss: 

In [None]:
#test how torch_max work
output = torch.tensor([[-1.5323, -1.6752, -1.5012, -1.9708, -1.4565, -6.5302]])
torch.max(output, 1)


torch.return_types.max(values=tensor([-1.4565]), indices=tensor([4]))

## 2.2 Part 2 Report
For Part 2, your report should have a description of each major step of implementing the RNN accompanied by the associated code-snippet. Each step should have an explanation for why you decided to do something (when one could reasonably do the same step in a different way); your justification will not be based on empirical results in this section but should relate to something we said in class, something mentioned in any of the course texts, or some other source (i.e. literature in NLP or official PyTorch documentation). **Unjustified, vague, and/or under-substantiated explanations will not receive credit.**

Things to include:

1. _Representation_ \
Each $\vec{x}_i$ needs to be produced in some way and should correspond to word $i$ in the text. This is different from the text classification approaches we have studied previously (BoW for example) where the entire document is represented with a single vector. Where and how is this being done for the RNN?

2. _Initialization_ \
There will be weights that you update in training the RNN. Where and how are these initialized?

3. _Training_ \
You are given the entire training set of N examples. How do you make use of this training set? How does the model modify its weights in training (this likely entails somewhere where gradients are computed and somehwere else where these gradients are used to update the model)?

4. _Model_ \
This is the core model code, ie. where and how you apply the RNN to the $\vec{x}_i$

5. _Linear Classifier_ \
Given the outputs of the RNN, how do you consume these to actually compute $\vec{y}$?

6. _Stopping_ \
How does your training procedure terminate?

7. _Hyperparameters_ \
To run your model, you must fix some hyperparameters, such as $h$ (the hidden dimensionality of the $\vec{z}_i$ referenced above). Be sure to exhaustively describe these hyperparameters and why you set them as you did ( this almost certainly will require some brief exploration: we suggest the course text by Yoav Goldberg as well as possibly the PyTorch official documentation). Be sure to accurately cite either source.



### 2.2.1 Representation


        #rnn preprocess func, return list of word index with pad
        def rnn_preprocessing(data, word2index, test=False):
          if test:
            vectorized_data = []
            max_l = find_max_seq_len(train)
            for document in data:
              vector = torch.zeros(max_l,1) 
              j = 0
              for word in document:
                index = word2index.get(word)
                if index == None:
                  index = word2index.get(unk)
                if index == 0:
                  index = 11832
                vector[j] = index
                j = j +1
              vectorized_data.append(vector)
          else:
            vectorized_data = []
            max_l = find_max_seq_len(train)
            for document, y in data:
              vector = torch.zeros(max_l, 1)
              #print(vector.size())
              j = 0  #jth seq in doc
              for word in document:
                index = word2index.get(word)
                if index == None:
                  index = word2index.get(unk)
                if index == 0:
                  index = 11832
                vector[j] = index
                j = j + 1
              vectorized_data.append((vector, y)
          return vectorized_data

        ```
        # This is formatted as code
        ```


        #this func will find the max seuence length in our data
        def find_max_seq_len(data):
          max_seq = 0
          for document in data:
            if len(document[0]) > max_seq:
              max_seq = len(document[0])
          return max_seq

        find_max_seq_len(train)

        #this func will find the length of each sequence and put everything to a list
        def find_each_seq_len(data):
          len_seq = []
          for document in data:
            len_seq.append(len(document[0]))
          return len_seq
        #find_each_seq_len(train)

        train_vectorized = rnn_preprocessing(train, word2index)
        unk = '<UNK>'
        val_vectorized = rnn_preprocessing(val, word2index)
        test_vectorized = rnn_preprocessing(test, word2index, True)

        #load data in a chosen batchsize 
        train_loader, val_loader = get_data_loaders(train_vectorized, val_vectorized,batch_size = 64)

```
# This is formatted as code
```



In our preprocessing, we don't want to use the one-hot vector for each word because the one-hot vector would have much larger dimensions due to our vocab's length, which will make the training slow and underfit. Instead, for the RNN preprocessing function, we replace every word with its corresponding index in the previous word2index dictionary. Then we pad the whole sequence within a matrix [sequence length, index], where sequence length is calculated based on the max length of the sequence in the training data [in our cases, it was 66]. We replace it with the index number for' UNK' for the data that is not included in our word2index dictionary. For the index that is out of each sequence's original length, we padded it with '0'. So during the training stage, the gradient for the '0' vector is always zero. [Because the original word for index '0' is 'a', we replace it with '11832' instead] Thus, all the data is padded with '0', and the vocab length is len(vocab) + 1 for our additional token PAD. When we feed the batch into the net, we first feed it into a word embedding layer (nn. Embedding) to learn its embeddings with an embedding dimension [in our cases, it was 100]. In other words, each word in our sequence will be represented by a vector of size 100. When we train the RNN, we also train these embeddings to represent the word features better. Finally, we use a tuple to put the data and the label together, and we feed it into our Dataloader of specified batch_size[64]. Below is an example of our training dataset, with a batch_size of 64.

In [None]:
data_iter = iter(train_loader)
tensor1, tensor2 = data_iter.next()
#input size
print(tensor1.squeeze(-1).size()) # batch_size, seq_length

#input
print(tensor1.squeeze(-1))

#label size
print(tensor2.size()) # batch_size
#label
print(tensor2)

#tranpose to produce data in [seq_length, batch_size] format
print(torch.transpose(tensor1, 0, 1).squeeze(-1).size())

### 2.2.2 Initialization




            def init_hidden(self):
                hidden = torch.autograd.Variable(torch.zeros(1, self.h_size))
                return hidden

```
# This is formatted as code
```



[Initialization documentation from nn.RNN library](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html)



[Initialization documentation from nn.Embedding library](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)


[Initialization documentation from nn.Linear library](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)

The weights initialization is basically three parts: embeddings, rnn, and linear. All weights in layers are initialized with random values.
For the weights in the embeddings layer, we use the default initialization described in the nn.Embedding library: the weights of the module of shape (num_embeddings, embedding_dim) initialized from N(0,1). For future improvement, we could load the pre-trained weight from Word2vec to increase accuracy.
For the weights in rnn layer, we also use the default initialization described in the nn.Rnn library: all the weights and bias are initialized from  u(- square_root(k), square_root(k)) where k is 1/(hidden_size). We also define a function [init_hidden] to initialize the hidden layer weights to zeros when we build our RNN model for custom initialization. Noted that in PyTorch library, if no initial hidden state is passed as an argument, it defaults to a tensor of all zeros.
For the weights in the linear layer, it was initialized in the same way as nn.RNN we discussed previously.


### 2.2.3 Training




        def count_acc(pre, label):
          correct = 0
          for i in range(len(label)):
            if pre[i] == label[i]:
              correct += 1
          return correct


        def train_func(model, itr, optimizer, criterion):
            
            loss_per_epoch = 0

            #correct predicitions
            corrects = 0
            model.train()
            #model.to(device)
            
            for batch, label in itr:
                batch =  torch.tensor(batch).to(torch.int64)
                batch = batch.cuda()
                label = label.cuda()
                
                optimizer.zero_grad()

                        
                output = model(batch, find_sent_len(batch))


                _,predictions = torch.max(output, 1)

                
                #Here is where the gradients are computed
                loss = criterion(output, label)
                #print(loss)
                corrects += count_acc(predictions, label)
              
                #here is the backpropagation & how gradients are updated
                loss.backward()
                optimizer.step()
                
                loss_per_epoch += loss.item()
                
            #return the average loss and num of corrects predictions    
            return loss_per_epoch / len(itr), corrects
            
            
        def evaluation_func(model, itr, criterion):
            
            loss_per_epoch = 0
            acc_per_epoch = 0
            corrects = 0
            model.eval()
            #make sure no gradients are calculated
            with torch.no_grad():
            
                for batch, label in itr:
                    batch =  torch.tensor(batch).to(torch.int64)
                    batch = batch.cuda()
                    label = label.cuda()

                    output = model(batch, find_sent_len(batch))

                    _,predictions = torch.max(output, 1)
                    
                    loss = criterion(output, label)
                    
                    #acc = binary_accuracy(predictions, label)
                    #if label == predictions:
                      #corrects += 1
                    corrects += count_acc(predictions, label)
                    
                

                    loss_per_epoch += loss.item()
                  
            #return the average loss and num of corrects predictions
            return loss_per_epoch / len(itr),  corrects

```
# This is formatted as code
```



        #below is the process when we call the training and val function
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        import torch
        torch.backends.cudnn.enabled = False

        #use cross entropy loss as loss func
        criterion = nn.CrossEntropyLoss()
        criterion = criterion.cuda()

        #modified batch_size
        batch_size = 64

        #training epoch
        num_epochs = 20

        #stop threshold
        best_loss = float('inf')

        for epoch in range(num_epochs):

            
            train_loss, train_acc = train_func(rnn, train_loader, optimizer, criterion)
            val_loss, val_acc = evaluation_func(rnn,  val_loader, criterion)
            
            #keep track of best loss
            if val_loss < best_loss:
                best_loss = val_loss
                torch.save(rnn.state_dict(), 'rnn_fixed.pth')
            #print our the training performace for each epoch
            print(f'num of epoch: {epoch+1:02}')
            print(f'\tTrain Loss: {train_loss:.3f} || Train Acc: {(train_acc/(len(train_loader)*batch_size))*100:.2f}%')
            print(f'\t Val. Loss: {val_loss:.3f} ||  Val. Acc: {(val_acc/(len(val_loader)*batch_size))*100:.2f}%')

```
# This is formatted as code
```



We first build our data loader with a specified batch_size. The data loader can shuffle random samples when we call the iterator during training. In the training stage, we feed our model the data with batch_size 64. First, we zero the gradients, and we do a transpose in our forward function to modify the batch data to [seq_len, batch_size]. Then we go through our embedding layer and do a dropout(0.2) after the embedding layer. After that, we pack our embedding to feed our RNN layer; since we have already pad our data, we can use the function pack_padded_sequence to let the model ignore the pad token so that they will have zero gradients. Finally, we feed our packed data to our rnn layer and compute the output and hidden. Because we construct a 2-layer rnn, we need to concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers together. Of course, we also need to unpack our outputs from the rnn output. Once we obtain our output from the rnn, we feed it to a fully connected linear layer and use our criterion function, which is nn.CrossEntropyLoss(), to calculate the loss for our outputs and labels. Then we use loss.backward() and optimizer.step() to do backpropagation and update our parameters according to the pre-defined learning rate and calculated gradient. Here, we use Adam optimizer to update those parameters.

### 2.2.4 Model


        import torch.nn as nn

        class RNN(nn.Module):
            def __init__(self, input_size, emb, h_size, output_size):
                
                super().__init__()
                
                self.emb = nn.Embedding(input_size, emb, padding_idx = 0)
                
                self.rnn = nn.RNN(emb, h_size, num_layers=2, dropout=0.2, nonlinearity='relu')
                
                self.fc = nn.Linear(h_size*2, output_size)
                self.relu = nn.ReLU
                self.dropout = nn.Dropout(0.2)

                # Ensure parameters are initialized to small values, see PyTorch documentation for guidance
                self.softmax = nn.LogSoftmax(dim=1)
                self.loss = nn.NLLLoss()
                
            def forward(self, seq, text_len):

                #sequence = [seq_len, batch_size]
                #print(seq.size())
                seq = torch.transpose(seq,0,1).squeeze(-1)
                #print(seq.size())
                emb = self.emb(seq)

                #enable dropout after embedding layer
                emb = self.dropout(emb)

                #pack the emb, not descending order, need to set enoforce_sorted = False
                #To avoid the end of the sequence containning bad information & useless vectors we pack the data with its original length
                packed_emb = rnn_utils.pack_padded_sequence(emb, text_len, enforce_sorted=False)

                #print(emb.size())
                #embedding = [seq_len, batch_size, emb_dim]
              
                #print(emb.size())
                output, hidden = self.rnn(packed_emb)

                output, output_lengths = rnn_utils.pad_packed_sequence(output)
                #padding token for output are all zero tensors
                
                #print(output.size())
                #print(hidden.size())


                #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
                hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
                return self.fc(hidden)

            def load_model(self, save_path):
                self.load_state_dict(torch.load(save_path))
          
            def save_model(self, save_path):
                torch.save(self.state_dict(), save_path)

            def init_hidden(self):
                hidden = torch.autograd.Variable(torch.zeros(1, self.h_size))
                return hidden

```
# This is formatted as code
```



Our RNN model was built on top of the torch.nn library, and it was composed of three layers: embedding layer, RNN layer, and fully connected linear layer. We first use an embedding layer to transform our vector to a dense embedding vector and expand our data into a matrix of [seq_len, batch_size, emb_dim]. We assume that words that have similar meanings will be mapped closed to each other in our embedding layer. After a dropout of the embedding layer, we feed such output to the rnn. The rnn takes in out vector and previous hidden state to calculate the next hidden state. The output from our rnn is the concatenation of the hidden state from every time step. In our rnn layer, we use ReLU as our activation function, and we do a drop out of 0.2. Finally, our linear layer takes the output from the hidden state and transform it into the correct output dimensions. The forward function is called when we feed data examples to our model.

### 2.2.5 Linear Classifier

     class RNN(nn.Module):
            def __init__(self, input_size, emb, h_size, output_size):
                
                super().__init__()
                ..........

                self.fc = nn.Linear(h_size*2, output_size)
                ..........
        
                
            def forward(self, seq, text_len):

                ..........
                return self.fc(hidden)

```
# This is formatted as code
```


            ........
            for batch, label in itr:
                batch =  torch.tensor(batch).to(torch.int64)
                batch = batch.cuda()
                label = label.cuda()
                
                optimizer.zero_grad()

                        
                output = model(batch, find_sent_len(batch))


                _,predictions = torch.max(output, 1)
            ........


Our fully connected linear layer will take the final hidden state and transform it into the correct output dimension. And we use the softmax function to compute the probability for each label. We chose the tensor with the best probability (torch.max) as our prediction output. In the training and validation stage, because our criterion function CrossEntropyLoss(), has already taken care of the softmax, so we don't need to do anything with it. But for the testing dataset, we need to perform softmax to output the best prediction.

### 2.2.6 Stopping




        #stop threshold
        best_loss = float('inf')

        for epoch in range(num_epochs):

            
            train_loss, train_acc = train_func(rnn, train_loader, optimizer, criterion)
            val_loss, val_acc = evaluation_func(rnn,  val_loader, criterion)
            .......
            #here is our stopping method
            #keep track of best loss
            if val_loss < best_loss:
                best_loss = val_loss
                torch.save(rnn.state_dict(), 'rnn_fixed.pth')
            ......

```
# This is formatted as code
```



We manually set up the number of epochs we want our model to train, and we modified the number of epochs based on the training loss/acc and val loss/acc[we have a function to keep track of the average loss and accuracy across each epoch]. We first initialize a small number to store the best val loss during training, and we save the best model only if the val loss for each epoch is lower than the best val loss. Thus we will always keep the best model in our memory even if our new model is overfitting. The training will stop once it finishes the number of epochs we set. Although we could set up a method to let the model stop earlier if the val loss is larger than the previous epochs, we keep our old method because we find sometimes the loss is vibrating. Thus, keeping track of the best loss is the optimal way.

### 2.2.7 Hyperparameters



        #set up RNN model
        hidden_layer = 1024

        out_size = 6

        #word embedding_dimension
        embedding_dimensions = 100
        
        #modified batch_size
        batch_size = 64
 
        #training epoch
        num_epochs = 20

        #learning_rate
        learning_rate = 0.0001

        rnn = RNN(len(vocab)+1, embedding_dimensions, hidden_layer, out_size)
        print(rnn)
        rnn = rnn.cuda()
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(rnn.parameters(), lr=learning_rate)
        #optimizer = optim.SGD(rnn.parameters(), lr=0.1, momentum=0.09)
   






The hyperparameters we have are hidden dimensions, batch size, learning rate, number of training epochs, embedding dimensions, optimizer. Because the effect of hyperparameters varies between datasets, we need to find out the our optimal hyperparameters based on our experiments. The hidden dimensions will have a significant impact on the model; we assume a large number of hidden dimensions will result in a complex model that is less likely to underfit but hard to train. Batch size is also a critical hyperparameter to tune. Too large of batch size will lead to poor generalization but computational speedups. While a small batch size will lead to slow converge, and can not guarantee to find the global optimal. The learning rate is also essential. With a large learning rate, we will have a massive update on our weights, which will be hard to converge at some point. With a small learning rate, the training process will be very slow and we may not find the global optimal. For the number of training epochs, we assume the model will perform better and better on training set while we increase the training epochs. Still, we need to stop it from overfitting on the training set by checking the loss and accuracy on Val set. The embedding dimensions in our embedding layer is also an important paramters. Since we can modify different representation of our word based on different embedding dimensions. We cannot set it to be too small, because we have a large vocab. However, larger embedding dimensions will cost more training time. Different optimizer also have effect on our model performance, we tried SGD and Adam and find Adam is our best fit in terms of accuracy and efficiency.

# Part 3: Analysis
From **Part 1** and **Part 2**, you will have two different models in hand for performing the same emotion detection task. In **Part 3**, you will conduct a comprehensive analysis of these models, focusing on two comparative settings.

## Part 3 Note
You will be required to submit the code used in finding these results on CMSX. This code should be legible and we will consult it if we find issues in the results. It is worth noting that in **Part 1** and **Part 2**, we primarily are considering the correctness of the code-snippets in the report. If your model is flawed in a way that isn’t exposed by those snippets, this will likely surface in your results for **Part 3**. We will deduct points for correctness in this section to reflect this and we will try to localize where the error is (or think it is, if it is opaque from your code). That said, we will be lenient about absolute performance (within reason) in this section.

## 3.1: Across-Model Comparison
In this section, you will report results detailing the comparison of the two models. Specifically, we will consider the issue of _fair comparison_<sup>5</sup>, which is a fundamental notion in NLP and ML research and practice. In particular, given model $A$, it is likely the case we can make a model $B$ that is computationally more complex and, hence, more costly and achieves superior performance. However, this makes for an unfair comparison. For our purposes, we want to study how the FFNN and RNN compare when we try to control for hyperparameters and other configurable values being of similar computational cost<sup>6</sup>. That said, it is impossible to have identical configurations as these are different models, i.e. the RNN simply has hyperparameters for which there are no analogues in the FFNN.


In the report you will need to begin by describing 3 pairs of configurations, with each pair being comprised of a FFNN configuration and a RNN configuration that constitute a _fair comparison_. You will need to argue for why the two parts of each pair are a fair comparison. Across the pairs, you should try different types of configurations (e.g. trying to resolve like questions of the form: _Does the FFNN perform better or worse when the hidden dimensionality is small as opposed to when it is large?_) and justify what you are trying to study by having the results across the pairs.


Next, you will report the quantitative accuracy of the 6 resulting models. You will
analyze these results and then move on to a more descriptive analysis.

The descriptive analysis can take one of two forms<sup>7</sup>:

1. _Nuanced quantitative analysis_ \
If you choose this option, you will need to further break down the quantitative statistics you reported initially. We provide some initial strategies to prime you for what you should think about in doing this: one possible starting point is to consider: if model $X$ achieves greater accuracy than model $Y$, to what extent is $X$ getting everything correct that $Y$ gets correct? Alternatively, how is model performance affected if you measure performance on a specific strata/subset of the reviews?

2. _Nuanced qualitative analysis_ \
If you choose this option, you will need to select individual examples and try to explain or reason about why one model may be getting them right whereas the other isn’t. Are there any examples that all 6 models get right or wrong and, if so, can you hypothesize a reason why this occurs?

In [None]:
#@markdown ⠀
display(HTML('''<hr><p style="font-family:verdana; font-size:90%;">
5. This term takes on different meanings in different settings. Here we simply mean that we are trying to
compare different models while controlling for similar “complexity”/computational cost. <br></br>

6. We have not taught you how to do this rigorously and the theory for doing this is still underdeveloped. We only expect a reasonable attempt. <br></br>

7. This is the minimal requirement, if you provide other, more elaborate, analyses, we certainly welcome this.
</p>'''))

### 3.1.1 Configuration 1
Modify the code below for this configuration.

In [None]:
ffnn_config_1 = FFNN()
rnn_config_1 = RNN()

### 3.1.1 Report
Describe configurations, report the results, and then perform a nuanced analysis

### 3.1.2 Configuration 2
Modify the code below for this configuration.

In [None]:
ffnn_config_2 = FFNN()
rnn_config_2 = RNN()

### 3.1.2 Report
Describe configurations, report the results, and then perform a nuanced analysis

### 3.1.3 Configuration 3
Modify the code below for this configuration.

In [None]:
ffnn_config_3 = FFNN()
rnn_config_3 = RNN()

### 3.1.3 Report
Describe configurations, report the results, and then perform a nuanced analysis

## Part 3.2: Within-model comparison
To complement **Part 3.1: Across-Model Comparison**, in **Part 3.2: Within-Model Comparison**, you will need to study what happens when you change parameters within a model. To limit your workload, you need only do this for the RNN; and you may use at most one RNN model from the prior section.

In the prior section, we discussed _fair comparison_. Anothr aspect of rigorous experimentation in NLP (and other domains) is the _ablation study_. In this, we _ablate_ or remove aspects of a more complex model, making it less complex, to evaluate whether each aspect was neccessary. To be concrete, for this part, you should train 4 variants of the RNN model and describe them as we do below:

1. Baseline model
2. Baseline model made more complex by modification $A$ (e.g. changing the hidden dimensionality from $h$ to $2h$).
3. Baseline model made more complex by modification $B$ (where $B$ is an entirely distinct/different update from $A$).
4. Baseline model with both modificatons $A$ and $B$ applied.

Under the framing of an ablation study, you woud describe this as beginning with model 4 and then ablating (i.e. removing) each of the two modifications, in turn; and then removing both to see if they were genuinely neccessary for the performance you observe.

Once you describe each of the four models, report the quantitative accuracy as in the previous section. Conclude by performing the **opposite** nuanced analysis from the one you did in the previous section (i.e. if in **Part 3.1: Across-Model Comparison** you did _Nuanced quanitative analysis_, for **Part 3.2: Within-Model Comparison** perform a _Nuanced qualitative analysis_ and vice versa).

### 3.2.1 Configuration 1
Modify the code below for this configuration.

In [None]:
baseline_rnn = RNN()

### 3.2.1 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

### 3.2.2 Configuration 2
Modify the code below for this configuration.

In [None]:
mod_a_rnn = RNN()

### 3.2.2 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

### 3.2.3 Configuration 3
Modify the code below for this configuration.

In [None]:
mod_b_rnn = RNN()

### 3.2.3 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

### 3.2.4 Configuration 4
Modify the code below for this configuration.

In [None]:
both_mod_rnn = RNN()

### 3.2.4 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis of the opposite type as before.

# Part 4: Questions
In **Part 4**, you will need to answer the three questions below. We expect answers tobe to-the-point; answers that are vague, meandering, or imprecise **will receive fewer points** than a precise but partially correct answer.

## 4.1 Q1
Earlier in the course, we studied models that make use of _Markov_ assumptions. Recurrent neural networks do not make any such assumption. That said, RNNs are known to struggle with long-distance dependencies. What is a fundamental reason for why this is the case?

## 4.2 Q2
In applying RNNs to tasks in NLP, we have discovered that (at least for tasks in English) feeding a sentence into an RNN backwards (i.e. inputting the sequence of vectors corresponding to ($course$, $great$, $a$, $is$, $NLP$) instead of ($NLP$, $is$, $a$, $great$, $course$)) tends to improve performance. Why might this be the case?

## 4.3 Q3
In using RNNs and word embeddings for NLP tasks, we are no longer required to engineer specific features that are useful for the task; the model discovers them automatically. Stated differently, it seems that neural models tend to discover better features than human researchers can directly specify. This comes at the cost of systems having to consume tremendous amounts of data to learn these kinds of patterns from the data. Beyond concerns of dataset size (and the computational resources required to process and train using this data as well as the further environmental harm that results from this process), why might we disfavor RNN models?

# Part 5: Miscellaneous
List the libraries you used and sources you referenced and cited (labelled with the section in which you referred to them). Include a description of how your group split
up the work. Include brief feedback on this asignment.

Part 1:
```
import json
import math
import os
from pathlib import Path
import random
import time
from tqdm.notebook import tqdm, trange
from typing import Dict, List, Set, Tuple

import numpy as np
import torch
import torch.nn as nn
from torch.nn import init
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from tqdm.notebook import tqdm, trange
from collections import Counter
```
Part 2:

```
import torch.nn.utils.rnn as rnn_utils
import torch.nn as nns code
```
Part 3:
```
import matplotlib.pyplot as plt
import nltk
import torch.nn as nn
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import torch
import numpy as np
```
Part 4: 
```
import torch.nn as nn
```



**Each section must be clearly labelled, complete, and the corresponding pages should be correctly assigned to the corresponding Gradescope rubric item.** If you follow these steps for each of the 4 components requested, you are guaranteed full credit for this section. Otherwise, you will receive no credit for this section.

# Part 6: Kaggle Submission

In [None]:
# Create Kaggle submission function
kaggle_model = None
rnn_document_preprocessor = lambda x: rnn_preprocessor(x, True) # This is for your RNN
file_name = "submission.csv"
ffnn_document_preprocessor = lambda x: convert_to_vector_representation(x, word2index, True)

In [None]:
import numpy as np
loaded_model = RNN(len(vocab)+1, 100, 1024, 6)

loaded_model.load_model("rnn_fixed.pth")

def generate_submission(file_name):
    with Path(file_name).open("w") as fp:
        fp.write("Id,Predicted\n")
        ct = 0
        for input_vector in test_loader:
            input_vector =  torch.tensor(input_vector).to(torch.int64)
            loaded_model.eval()
            with torch.no_grad():
              output = loaded_model(input_vector.cpu(),find_sent_len(input_vector))     #.squeeze(0)
              _, pred = torch.max(output, 1)
              pred = pred.numpy()
              for i in range(len(pred)):
                fp.write(f"{(i + 64*ct)},{int(pred[i])}\n")
              ct = ct + 1
    return

In [None]:
test_loader = DataLoader(test_vectorized, shuffle=False, batch_size=64)

In [None]:
def decode_seq(batch_data):
  batch_data = batch_data.squeeze(-1)
  batch_sent = [[]]*len(batch_data)
  for seq_index in range(len(batch_data)):
    sent = []
    for j in range(66):
      index = batch_data[seq_index][j].numpy()
      if int(index) == 11832:
        word = 'a'
      elif int(index) == 0:
        word = '<pad>'
      elif int(index) != 0 & int(index) != 11832:
        word = index2word[int(index)]
      sent.append(word)
    batch_sent[seq_index] = sent
  return batch_sent



In [None]:
for item,label in iter(val_loader):
  print(decode_seq(item))
  print(label)

[['i', 'said', 'before', 'i', 'feel', 'like', 'a', 'hypocrite', '<UNK>', 'for', 'diabetes', 'support', 'and', 'awareness', 'without', 'supporting', 'my', 'own', 'situation', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['i', 'feel', '<UNK>', 'and', 'i', 'cant', 'be', 'bothered', 'to', 'fight', 'it', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad

In [None]:
generate_submission('submission2.csv')

  if sys.path[0] == '':


# Live running demo

In [None]:
#@title Emotion Detection
#@markdown Enter a sentence to see the emotion
input_string = "I am so joyful!" #@param {type:"string"}
model_type = "baseline_ffnn" #@param ["baseline_ffnn", "baseline_rnn", "mod_a_rnn", "mod_b_rnn", "both_mods_rnn", "ffnn_config_1", "rnn_config_1", "ffnn_config_2", "rnn_config_2", "ffnn_config_3", "rnn_config_3"]
from IPython.display import HTML

output = ""

# BAD THING TO DO BELOW!!
model_used = globals()[model_type]

with torch.no_grad():
    if "ffnn" in model_type:
        vec_in = ffnn_document_preprocessor([[input_string]])[0]
        model_output = model_used(torch.Tensor(vec_in).unsqueeze(0)).cpu().squeeze(0)
    else:
        # RUN MODEL
        vec_in = rnn_document_preprocessor([[input_string]])[0]
        model_output = model_used(torch.Tensor(vec_in).unsqueeze(0)).cpu().squeeze(0)
    #print(torch.cat([torch.Tensor(z).unsqueeze(0) for z in model_inputs]).unsqueeze(0).shape)
    #model_output = model_used(torch.cat([torch.Tensor(z).unsqueeze(0) for z in model_inputs]).unsqueeze(0))
    #print(model_output.shape)
predicted = torch.argmax(model_output)
# MAP BACK TO EMOTION
# print(int(predicted))
emotion = idx_to_emotion[int(predicted)]

# Generate nice display
output += '<p style="font-family:verdana; font-size:110%;">'
output += " Input sequence: "+input_string+"</p>"
output += '<p style="font-family:verdana; font-size:110%;">'
output += f" Emotion detected: {emotion}</p><hr>"
output = "<h3>Results:</h3>" + output

display(HTML(output))

KeyError: ignored