# Assignment 5: Neural Networks

---

## Task 1) RNN as Language Model

Similar to the n-gram language models in the previous tasks, imagine you have to write another thesis and just want to generate an interesting topic.
In this assignment, you will train and use Recurrent Neural Networks as language models to generate new potential thesis topics.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the assignment on Recurrent Neural Networks, we'll (again) heavily use [PyTorch](https://pytorch.org) as go-to Deep Learning library.
Here, we'll rely on the RNN and Embedding modules already implemented by PyTorch.
You can imagine the Embedding layer as a simple lookup table that stores embeddings of a fixed dictionary and size (quite similar to the Word2Vec parameters we've trained in assignment 2).
Head over to the [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) and [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) modules to gain some understanding of their functionality.
Code for processing data samples, batching, converting to tensors, etc. can get messy and hard to maintain. 
Therefore, you can use PyTorch's [Datasets & DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). 
Get familiar with the basics of data handling, as it will help you for upcoming assignments.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [None]:
# Dependencies
import os
import tqdm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the modeling part such as nn.Embedding.

1.3 Create a PyTorch Dataset class which handles your tokenized data with respect to model inputs and labels.

In [None]:
def load_theses_dataset(filepath):
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [None]:
### Notice: Think about start and end of sentence tokens

def preprocess(dataframe):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
# dataframe = load_theses_dataset(...)
# tokenized_data = preprocess(dataframe)
# vocabulary = ...
# word2idx = ...
# idx2word = ...

In [None]:
### TODO: 1.3 Implement the PyTorch theses dataset
### Notice: It is possible to solve the task without this class.
### Notice: However, with respect to DataLoaders it makes your life easier.

### YOUR CODE HERE

class ThesesDataset(Dataset):
    def __init__(self, dataset, word2idx):
        # TODO
        self.data, self.labels = [], []


    def __len__(self):
        return len(self.data)


    def __getitem__(self, idx):
        # TODO
        sample = None
        labels = None
        return sample, labels
    
### END YOUR CODE

### Train and Evaluate

2.1 Implement the RNN Language Model. Therefore, you can use the nn.Module and overwrite the forward function. For the embedding layer you can either use the embeddings learned from the previous word2vec assignment or train the `nn.Embedding` module and corresponding parameters from scratch.

2.2 Implement the functionality to train your model with the train dataset.

2.3 Implement the functionality to evaluate your model with the test dataset.

2.4 Perform a train-test-split for your theses data, train the RNN Language Model and evaluate the loss & perplexity.

In [None]:
### TODO: 2.1 Implement the RNN Language Model (nn.Module)

### YOUR CODE HERE

class RNN_LM(nn.Module):
    def __init__(self, arguments):
        super(RNN_LM, self).__init__()
        # TODO

    
    def forward(self, X, hidden=None):
        # TODO
        raise NotImplementedError()

### END YOUR CODE

In [None]:
### TODO: 2.2 Implement the train functionality
### Notice: If you want, you can also combine train and eval functionality

def train(arguments):
    """Trains the RNN-LM for one epoch."""
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
### TODO: 2.3 Implement the evaluation functionality
### Notice: If you want, you can also combine train and eval

def eval(arguments):
    """Evaluates the optimized RNN-LM."""
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
### TODO: 2.4 Initialize and train the RNN Language Model for X epochs

# For split reproducibility
# Optional: Use 5-fold cross validation
SEED = 42

EPOCHS = 100

DEVICE = "cpu" # 'cpu', 'mps' or 'cuda'

### YOUR CODE HERE

# Use batch_size=1 if you want to avoid padding handling
train_dataset = None
train_dataloader = None

# Use batch_size=1 if you want to avoid padding handling
test_dataset = None
test_dataloader = None

# Your language model
model = None

# Your loss function
criterion = None

# Your optimizer (optim.SGD should be okay)
optimizer = None


# TODO: Training for epoch i

# TODO: Evaluation for epoch i


### END YOUR CODE

### Generate Titles

3.1 Use the trained RNN Language Model to generate theses titles. How can you sample the next tokens?

3.2 Compare your results with n-gram language models (e.g., n=4). Of course, you can use a library such as NLTK toolkit
- What perplexity does a regular 4-gram have on the same split? 
- Compare the generated titles from the 4-gram and RNN-LM. Do you think the n-gram titles are better?

In [None]:
### TODO: 3.1 Generate titles with the trained RNN Language Model

def generate(arguments):
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

for i in range(10):
    generated_title = generate(None)
    print(" ".join(generated_title))

In [None]:
### TODO: 3.2 Generate titles with the trained n-gram language model

### YOUR CODE HERE



### END YOUR CODE