# Assignment 6: Attention (please!)

---

## Task 2) Sentiment Analysis

In this task, we'll use the kaggle Rotten Tomatoes Dataset for this exercise: [Source and Download instructions](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data).
The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset.
The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order.
Each sentence has been parsed into many phrases (chunks) using the Stanford parser.
Each phrase has a `PhraseId`, each sentence a `SentenceId`.
Phrases that are repeated (such as short/common words) are only included once in the data.

### Data

Rotten Tomatoes Dataset: `train.tsv` contains the phrases and their associated sentiment labels.
We have additionally provided a `SentenceId` so that you can track which phrases belong to a single sentence.
`test.tsv` contains just phrases; use your model to assign a sentiment label to each phrase.

The sentiment labels are:

* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive

### GloVe Word Embeddings

Use GloVe word embeddings for your `nn.Embedding` layer, there is a number of pretrained models for English available in the `torchtext` module.
You are free to  use any kind of attention and architecture you like.
Just remember that the basic form for attention based networks is always and encoder / Decoder architecture.
Use `torchtext.vocab.GloVe` to get started quickly with the word embeddings.


*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [None]:
# Dependencies
import os
import re
import tqdm
import string
import numpy as np
import pandas as pd
import sklearn.metrics as sklearn_metrics
from sklearn.model_selection import StratifiedKFold

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

### Prepare the Data

1.1 As always: conduct some data preprocessing.

1.2 Download and prepare the GloVe word embeddings. You'll need it for the modeling part such as nn.Embedding.

1.3 Create a PyTorch Dataset class which handles your tokenized data with respect to input and (class) labels.

In [None]:
def load_sentiment_dataset(filepath):
    """Loads all phrase instances and returns them as a dataframe."""
    ### YOUR CODE HERE
    
    return pd.read_csv(filepath, header=0, sep='\t')
    
    ### END YOUR CODE

In [None]:
def preprocess(dataframe):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    def _preprocss_fn(text):
        remove_pun = str.maketrans(string.punctuation, ' '*len(string.punctuation))
        remove_digits = str.maketrans(string.digits, ' '*len(string.digits))
        text = text.translate(remove_digits)
        text = text.translate(remove_pun)
        text = re.sub(' {2,}', ' ', text)
        return text.lower()
    
    dataframe = dataframe.copy()
    
    # Remove punctuation, digits and lowercase phrases
    dataframe["Phrase"] = dataframe["Phrase"].apply(lambda s: _preprocss_fn(s))

    # Filter out empty phrases
    dataframe = dataframe[dataframe["Phrase"].str.len() > 1]

    # Reset index of dataframe
    dataframe = dataframe.reset_index(drop=True)

    # Simple tokenization of phrases
    dataframe["tokenized"] = [title.split() for title in dataframe["Phrase"].values]

    return dataframe

    ### END YOUR CODE

In [None]:
# Load and preprocess dataset
train_dataframe = load_sentiment_dataset("data/rotten_tomatoes_train.tsv")
train_dataframe = preprocess(train_dataframe)

# Map for formatting labels
IDX2SENTIMENT = {0: "negative", 1: "somewhat negative", 2: "neutral",
                 3: "somewhat positive", 4: "positive"}

# Test labels not available and submission to Kaggle required
# test_dataframe = load_sentiment_dataset("data/rotten_tomatoes_test.tsv")
# test_dataframe = preprocess(test_dataframe)

print(f"Num train dataset: {len(train_dataframe)}")
# print(f"Num test dataset: {len(test_dataframe)}")

In [None]:
### Download the pre-trained (english) GloVe embeddings
from torchtext.vocab import GloVe

# Prepare glove embeddings
UNK_TOKEN = "<unk>"
PAD_TOKEN = "<pad>"

def append_special(glove, special, vec=None):
    glove.itos.append(special)
    glove.stoi[special] = glove.itos.index(special)
    if vec is None:
        vec = torch.zeros(1, glove.vectors.size(1))
    glove.vectors = torch.cat((glove.vectors, vec))
    return glove

glove = GloVe(name="6B", dim=50)

# We need to add some special tokens
glove = append_special(glove, UNK_TOKEN)
glove = append_special(glove, PAD_TOKEN)

In [None]:
class RottenTomatoesDataset(Dataset):
    def __init__(self, dataset, labels, glove, unk="<unk>"):
        self.data, self.labels = [], []
        for tokens, label in zip(dataset, labels):
            # Create inputs; map tokens to ids
            self.data.append(torch.stack([
                torch.tensor(glove.stoi.get(w, glove.stoi.get(unk)), dtype=torch.long) for w in tokens
            ]))

            # Create labels; already an integer
            self.labels.append(label)


    def __len__(self):
        return len(self.data)


    def __getitem__(self, idx):
        # Returns one input and label sample
        return self.data[idx], self.labels[idx]

### Train and Evaluate

2.1 Implement and reuse your RNN-based classifciation models for the sentiment classification task. 

2.2 Train and evaluate your models by performing a train-test-split on the `train.tsv` file.

2.3 Check and compare your classification results with some publicly available baselines (there are plenty of on the internet).

2.4 Visualize the attention weights for the words and pick some nice samples for each sentiment category!

In [None]:
### TODO: 2.1 Implement RNN classifier (nn.Module)
### Notice: Think about padding for batch sizes > 1
### Notice: 'torch.nn.utils.rnn' provides functionality
### Notice: Here you can integrate the attention mechanism

### YOUR CODE HERE

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence, pad_sequence

class GRU_Classifier(nn.Module):
    def __init__(self, glove, hidden_dim, num_classes, with_attention=False):
        super(GRU_Classifier, self).__init__()
        self.with_attention = with_attention

        # TODO: add glove embeddings
        self.embedding = None

        self.rnn = nn.GRU(
            input_size=glove.dim,
            hidden_size=hidden_dim,
            bidirectional=False,
            num_layers=1
        )

        self.fc = nn.Linear(hidden_dim, num_classes)

    
    def forward(self, X, lengths, hidden=None):
        embeddings = self.embedding(X)

        # Packed squence helps avoid unneccsary computation
        padded_seq = pack_padded_sequence(embeddings, lengths)

        outputs, hidden_states = self.rnn(padded_seq, hidden)

        # If tuple (h_n, c_n) containts cell state c_n then select h_n
        if isinstance(hidden_states, tuple):
            hidden = hidden_states[0]
        else:
            hidden = hidden_states

        clf_input, weights = hidden, None

        # Apply classifier with hidden states
        logits = self.fc(clf_input.squeeze(0))

        return logits, hidden_states, weights


class SequencePadder():
    def __init__(self, symbol) -> None:
        self.symbol = symbol

    def __call__(self, batch):
        sorted_batch = sorted(batch, key=lambda x: x[0].size(0), reverse=True)
        sequences = [x[0] for x in sorted_batch]
        labels = [x[1] for x in sorted_batch]
        padded = pad_sequence(sequences, padding_value=self.symbol)
        lengths = torch.LongTensor([len(x) for x in sequences])
        return padded, torch.LongTensor(labels), lengths


### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the train functionality

### YOUR CODE HERE

def train(model, dataloader, criterion, optimizer, device):
    raise NotImplementedError()

### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the evaluation functionality

### YOUR CODE HERE

def eval(model, dataloader, criterion, device, return_attn_dict=False):
    raise NotImplementedError()

### END YOUR CODE

In [None]:
### TODO: 2.3 Initialize and train the RNN Classification Model for X epochs + Evaluation

# Training parameters
SEED = 42
EPOCHS = 10
BATCH_SIZE = 16

LEARNING_RATE = 0.0001

DEVICE = "cpu" # 'cpu', 'mps' or 'cuda'
LABEL_COL = "Sentiment"
PAD_IDX = glove.stoi[PAD_TOKEN]

### YOUR CODE HERE

### Notice: Data loading example
data_samples = train_dataframe.tokenized
data_labels = train_dataframe[LABEL_COL].values
dataset = RottenTomatoesDataset(data_samples, data_labels, glove=glove, unk=UNK_TOKEN)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=SequencePadder(PAD_IDX))

### END YOUR CODE

In [None]:
### TODO: 2.4 Visualize the attention weights

### YOUR CODE HERE



### END YOUR CODE