# Assignment 5: Neural Networks

---

## Task 2) RNN for Classification

The theses dataset also contains types (diploma, bachelor, master) and categories (internal/external) for each thesis. 
In this part, we want to classify whether the thesis is bachelor or master; and if it's internal or external. 
Since PyTorch provides most things sort-of out of the box, we want you to compare the following Recurrent Neural Network variation: 
[RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html), [GRU](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html), [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html), and Bidirectional-[LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) by using the `bidirectional` flag.
The basic setup as well as some code and steps can be reused from your solution for the language modeling task.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the assignment on Recurrent Neural Networks, we'll (again) heavily use [PyTorch](https://pytorch.org) as go-to Deep Learning library.
Here, we'll rely on the RNN and Embedding modules already implemented by PyTorch.
You can imagine the Embedding layer as a simple lookup table that stores embeddings of a fixed dictionary and size (quite similar to the Word2Vec parameters we've trained in assignment 2).
Head over to the [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) and [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) modules to gain some understanding of their functionality.
Code for processing data samples, batching, converting to tensors, etc. can get messy and hard to maintain. 
Therefore, you can use PyTorch's [Datasets & DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). 
Get familiar with the basics of data handling, as it will help you for upcoming assignments.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [None]:
# Dependencies
import os
import tqdm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the modeling part such as nn.Embedding.

1.3 Filter out all diploma theses; they might be too easy to spot because they only cover "old" topics.

1.4 Create a PyTorch Dataset class which handles your tokenized data with respect to input and (class) labels.

In [None]:
def load_theses_dataset(filepath):
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [None]:
### Notice: Think about start and end of sentence tokens

def preprocess(dataframe):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
# dataframe = load_theses_dataset(...)
# tokenized_data = preprocess(dataframe)
# vocabulary = ...
# word2idx = ...
# idx2word = ...

In [None]:
### TODO: 1.3 Implement the PyTorch theses dataset
### Notice: It is possible to solve the task without this class.
### Notice: However, with respect to DataLoaders it makes your life easier.

### YOUR CODE HERE

class ThesesClassificationDataset(Dataset):
    def __init__(self, dataset, classes, word2idx):
        # TODO
        self.data, self.labels = [], []


    def __len__(self):
        return len(self.data)


    def __getitem__(self, idx):
        # TODO
        sample = None
        labels = None
        return sample, labels
    
### END YOUR CODE

### Train and Evaluate

2.1 Implement the RNN for Classification. Therefore, you can use the nn.Module and overwrite the forward function.

2.2 Train and evaluate your models with 5-fold cross-validation. As in RNN-LM, you can either learn the embeddings from scratch or reuse the ones from word2vec.

2.3 Assemble a table: Recall/Precision/F1 measure for each of the mentioned RNN variants (RNN, GRU, LSTM). Which one works best?

2.4 Bonus: Apply your best classifier to the remaining diploma theses; are those on average more bachelor or master? :-)

In [None]:
### TODO: 2.1 Implement the RNN classifier (nn.Module)
### Notice: Think about padding for batch sizes > 1
### Notice: 'torch.nn.utils.rnn' provides functionality

### YOUR CODE HERE

class RNN_Classifier(nn.Module):
    def __init__(self, arguments):
        super(RNN_Classifier, self).__init__()
        # TODO

    
    def forward(self, X, hidden=None):
        # TODO
        raise NotImplementedError()

### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the train functionality
### Notice: If you want, you can also combine train and eval functionality

def train(arguments):
    """Trains the RNN-Classifier for one epoch."""
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the evaluation functionality
### Notice: If you want, you can also combine train and eval

def eval(arguments):
    """Evaluates the optimized RNN-Classifier."""
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
### TODO: 2.2 Initialize and train the RNN-Classifier for X epochs

# For split reproducibility
# Use 5-fold cross validation
SEED = 42

EPOCHS = 25

DEVICE = "cpu" # 'cpu', 'mps' or 'cuda'

LABEL_COL = "Grad"

### YOUR CODE HERE

# Use batch_size=1 if you want to avoid padding handling
train_dataset = None
train_dataloader = None

# Use batch_size=1 if you want to avoid padding handling
test_dataset = None
test_dataloader = None

# Your language model
model = None

# Your loss function
criterion = None

# Your optimizer (optim.SGD should be okay)
optimizer = None


# TODO: Training for epoch i

# TODO: Evaluation for epoch i


### END YOUR CODE

In [None]:
### TODO: 2.3 Compare the results of various RNN variants (classification metrics)

### YOUR CODE HERE



### END YOUR CODE

In [None]:
### TODO: 2.4 (Optional) Apply your best classifier to the diploma theses

### YOUR CODE HERE



### END YOUR CODE