# Assignment 6: Attention (please!)

---

## Task 1) Thesis Title Classification

In this assignment, we'll again rely on the theses dataset and want to classify whether a thesis is bachelor or master.
Update your B.Sc. / M.Sc. thesis title classification model from the previous assignment and integrate the attention mechanism.
Therefore, implement the `dot product attention` and check how it affects the training and performance for this task.
In case you want to start fresh, we provide some boiler plate code of a base RNN classification model as well as ready-to-go data loading.
The basic setup as well as some code and steps can be reused from your solution for the RNN tasks.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [None]:
# Dependencies
import os
import re
import tqdm
import string
import numpy as np
import pandas as pd
import sklearn.metrics as sklearn_metrics

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the modeling part such as nn.Embedding.

1.3 Filter out all diploma theses; they might be too easy to spot because they only cover "old" topics.

1.4 Create a PyTorch Dataset class which handles your tokenized data with respect to input and (class) labels.

In [None]:
def load_theses_dataset(filepath):
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
    
    return pd.read_csv(filepath, header=0, sep=";")
    
    ### END YOUR CODE

In [None]:
def preprocess(dataframe):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    def _preprocss_fn(text):
        remove_digits = str.maketrans(string.digits, ' '*len(string.digits))
        remove_pun = str.maketrans(string.punctuation, ' '*len(string.punctuation))
        text = text.translate(remove_digits)
        text = text.translate(remove_pun)
        text = re.sub(' {2,}', ' ', text)
        return text.lower()
    
    dataframe = dataframe.copy()
    
    # Remove punctuation, digits and lowercase titles
    dataframe["Titel"] = dataframe["Titel"].apply(lambda s: _preprocss_fn(s))

    # Filter out empty and short titles
    dataframe = dataframe[dataframe["Titel"].str.len() > 4]

    # Reset index of dataframe
    dataframe = dataframe.reset_index(drop=True)

    # Simple tokenization of titles
    dataframe["tokenized"] = [title.split() for title in dataframe["Titel"].values]

    return dataframe

    ### END YOUR CODE

In [None]:
# Load and preprocess dataset
dataframe_all = load_theses_dataset("data/theses2022.csv")
dataframe_all = dataframe_all[dataframe_all["Sprache"] == "DE"]
dataframe_all = preprocess(dataframe_all)

# Convert labels to integer
LABEL2IDX = {"Bachelor": 0, "Master": 1, "Diplom": 2}
dataframe_all["label"] = dataframe_all["Grad"].apply(lambda l: LABEL2IDX[l])

# Filter out `Diplom`
dataframe_diplom = dataframe_all[dataframe_all["Grad"] == "Diplom"]
dataframe = dataframe_all[dataframe_all["Grad"] != "Diplom"]

# Check number of samples and label distribution
print(f"Num theses (overall): {len(dataframe_all)}")
print(f"Num theses (w/o diplom): {len(dataframe)}")
print(f"Num theses (diplom): {len(dataframe_diplom)}")
print()
print(dataframe_all["Grad"].value_counts())

In [None]:
### Notice: Think about padding tokens for batch sizes > 1

vocab = set()
vocab.add("<pad>")

# For a more realistic application, we have to deal with unknown tokens
# that were not present in the training corpus. However, for the sake
# of clarity, we add all possible tokens from our dataset.
# vocab.add("<unk>")

# Prepare vocabulary
for s in dataframe_all.tokenized:
    vocab.update(s)

vocab_size = len(vocab)

word2idx = {w: idx for (idx, w) in enumerate(sorted(vocab))}
idx2word = {idx: w for (idx, w) in enumerate(sorted(vocab))}

print(f"Vocabulary size: {vocab_size}")

In [None]:
### PyTorch dataset for our thesis classification task

class ThesisClassificationDataset(Dataset):
    def __init__(self, dataset, labels, word2idx):
        self.data, self.labels = [], []
        for tokens, label in zip(dataset, labels):
            # Create inputs; map tokens to ids
            self.data.append(torch.stack([
                torch.tensor(word2idx[w], dtype=torch.long) for w in tokens
            ]))

            # Create labels; already an integer
            self.labels.append(label)


    def __len__(self):
        return len(self.data)


    def __getitem__(self, idx):
        # Returns one input and label sample
        return self.data[idx], self.labels[idx]

### Train and Evaluate

2.1 Implement the dot product attention mechanism and integrate it into your RNN classification model.

2.2 Train and evaluate your models with a train-test-split (or optional 5-fold cross-validation).

2.3 Assemble a table: Recall/Precision/F1 measure for RNN classification with and without attention. Do your results improve w.r.t. your old model?

2.4 Can you find certain words that receive high attention weights regarding the decision?

In [None]:
### TODO: 2.1 Implement RNN classifier (nn.Module)
### Notice: Think about padding for batch sizes > 1
### Notice: 'torch.nn.utils.rnn' provides functionality
### Notice: Here you can integrate the attention mechanism

### YOUR CODE HERE

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence, pad_sequence

class GRU_Classifier(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, hidden_dim, num_classes,
                 with_attention=False):
        super(GRU_Classifier, self).__init__()
        self.with_attention = with_attention

        self.embedding = nn.Embedding(
            num_embeddings=num_embeddings, 
            embedding_dim=embedding_dim
        )

        self.rnn = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            bidirectional=False,
            num_layers=1
        )

        self.fc = nn.Linear(hidden_dim, num_classes)

    
    def forward(self, X, lengths, hidden=None):
        embeddings = self.embedding(X)

        # Packed squence helps avoid unneccsary computation
        packed_seq = pack_padded_sequence(embeddings, lengths)

        outputs, hidden_states = self.rnn(packed_seq , hidden)

        # If tuple (h_n, c_n) containts cell state c_n then select h_n
        if isinstance(hidden_states, tuple):
            hidden = hidden_states[0]
        else:
            hidden = hidden_states

        clf_input, weights = hidden, None

        # Apply classifier with hidden states
        logits = self.fc(clf_input.squeeze(0))

        return logits, hidden_states, weights


class SequencePadder():
    def __init__(self, symbol) -> None:
        self.symbol = symbol

    def __call__(self, batch):
        sorted_batch = sorted(batch, key=lambda x: x[0].size(0), reverse=True)
        sequences = [x[0] for x in sorted_batch]
        labels = [x[1] for x in sorted_batch]
        padded = pad_sequence(sequences, padding_value=self.symbol)
        lengths = torch.LongTensor([len(x) for x in sequences])
        return padded, torch.LongTensor(labels), lengths


### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the train functionality

### YOUR CODE HERE

def train(model, dataloader, criterion, optimizer, device):
    raise NotImplementedError()

### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the evaluation functionality

### YOUR CODE HERE

def eval(model, dataloader, criterion, device, return_attn_dict=False):
    raise NotImplementedError()

### END YOUR CODE

In [None]:
### TODO: 2.3 Initialize and train the RNN Classification Model for X epochs + Evaluation

# Training parameters
SEED = 42
EPOCHS = 10
BATCH_SIZE = 16

LEARNING_RATE = 0.0001

DEVICE = "cpu" # 'cpu', 'mps' or 'cuda'
LABEL_COL = "label"
PAD_IDX = word2idx["<pad>"]


### YOUR CODE HERE

### Notice: Data loading example
data_samples = dataframe.tokenized
data_labels = dataframe[LABEL_COL].values
dataset = ThesisClassificationDataset(data_samples, data_labels, word2idx=word2idx)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=SequencePadder(PAD_IDX))

### END YOUR CODE

In [None]:
### TODO: 2.4 Visualize the attention weights

### YOUR CODE HERE



### END YOUR CODE